ArticlePDF Available

Fine-Grained Dynamic Resource Allocation for Big-Data Applications

July 2019
IEEE Transactions on Software Engineering PP(99):1-1

July 2019
PP(99):1-1

DOI:10.1109/TSE.2019.2931537

Authors:

Luciano Baresi

Politecnico di Milano

Alberto Leva

Politecnico di Milano

Giovanni Quattrocchi

Politecnico di Milano

Big-data applications are batch applications that exploit dedicated frameworks to perform massively parallel computations across clusters of machines. The time needed to process the entirety of the inputs represents the application's response time, which can be subject to deadlines. Spark, probably the most famous incarnation of these frameworks today, allocates resources to applications statically at the beginning of the execution and deviations are not managed: to meet the applications' deadlines, resources must be allocated carefully. This paper proposes an extension to Spark, called xSpark, that is able to allocate and redistribute resources to applications dynamically to meet deadlines and cope with the execution of unanticipated applications. This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain the precision and speed that are needed. Our evaluation shows that xSpark can (i) allocate resources efficiently to execute single applications with respect to set deadlines and (ii) reduce deadline violations (w.r.t. Spark) when executing multiple concurrent applications.

The DAG of our example application.

…

Set point generation for a local controller.

…

Bode magnitude diagram of the required open-loop frequency response.

…

Example of two resource requests made by two applications running in a shared environment; the maximum number of allocatable cores is 8.

…

Example of how xSpark executors work.

…

Figures - uploaded by Giovanni Quattrocchi

Content may be subject to copyright.

Content uploaded by Giovanni Quattrocchi

Content may be subject to copyright.

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 1

Fine-grained Dynamic Resource Allocation

for Big-Data Applications

Luciano Baresi, Sam Guinea, Alberto Leva and Giovanni Quattrocchi

Abstract—Big-data applications are batch applications that exploit dedicated frameworks to perform massively parallel computations

across clusters of machines. The time needed to process the entirety of the inputs represents the application’s response time, which can

be subject to deadlines. Spark, probably the most famous incarnation of these frameworks today, allocates resources to applications

statically at the beginning of the execution and deviations are not managed: to meet the applications’ deadlines, resources must be

allocated carefully. This paper proposes an extension to Spark, called xSpark, that is able to allocate and redistribute resources to

applications dynamically to meet deadlines and cope with the execution of unanticipated applications. This work is based on two key

enablers: containers, to isolate Spark’s parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to

govern resource allocation at runtime and obtain the precision and speed that are needed. Our evaluation shows that xSpark can (i)

allocate resources efﬁciently to execute single applications with respect to set deadlines and (ii) reduce deadline violations (w.r.t. Spark)

when executing multiple concurrent applications.

Index Terms—Distributed architectures; Control Theory; Quality assurance; Batch processing systems

✦

1 INTRODUCTION

HE resource management of web-, service-, and cloud-

based applications [1]–[6] has been studied for years.

In these cases resources are usually provisioned to meet

response times and/or throughput thresholds, and their

fulﬁllment typically depends on the intensity and variety of

incoming requests [7]. Big-data [8] applications are different:

they are batch computations that transform, aggregate, and

analyze (extremely) large amounts of data. Special-purpose

frameworks [9]–[11] slice these data and carry out their

analysis by means of parallel processes executed on a

distributed cluster of virtual (or physical) machines. Inputs

are provided once and for all at the start of a run, and a

single execution may take several minutes or hours. The

response time is deﬁned as the time it takes to process the

entire set of inputs of a single application run, and resources

are provisioned to meet a deadline, i.e., a maximum threshold

for the response time [12]. Furthermore, big-data applications

that exploit the same framework, that is, the same cluster of

machines, compete for the resources made available by the

framework itself.

Spark [13] is the most widely used framework for

these applications. It is more ﬂexible than Hadoop [9] and

can support more complex computations. Indeed, a Spark

program can embed multiple transformations and actions

by organizing them in a direct acyclic graph (DAG). Spark

allocates resources to applications statically, at the beginning

of their execution, and tends to use all the provisioned

resources. The only dynamism managed by Spark refers

to switching preallocated executors off or on; for example

when they remain idle for a user-deﬁned amount of time,

or some tasks have to wait for too long (and idle executors

are available). In addition, the resources that are provisioned

to executors (e.g., CPU cores) cannot be changed. Scalability

is only horizontal and is based on simple time-outs and on

the availability of preallocated executors. This means that

the resources that are allocated to the applications —to meet

their deﬁned deadlines— must be planned carefully, and that

runtime deviations are not managed. In addition, Spark can-

not (dynamically) redistribute the available resources among

the concurrent applications, nor among applications whose

concurrent executions were not planned at the beginning.

Literature shows that the static resource allocation prob-

lem for big-data applications is a hot research topic. For

example, [14], [15] propose solutions for the resource allo-

cation of Hadoop applications, [16], [17] introduce resource

optimization models for the a-priori allocation of resources

in Spark, [18] presents a performance model of Spark

applications. Runtime resource allocation, on the other hand,

has received limited attention so far.

The runtime resource allocation problem can be seen

as the capability to provision resources as needed, and not

just as initially planned. This is not a simple task, since the

provisioning will depend on multiple elements. Some will

be known beforehand (e.g., the size of the input data set),

some will be known after a proﬁling step (e.g., the nominal

performance of the system), while others will only be known

during the actual execution (e.g., performance, failures, and

the characteristics of other applications that are competing

for the cluster’s resources).

Dynamic resource provisioning becomes a means to make

different applications, which execute concurrently, share

resources efﬁciently. Dynamic provisioning would allow

for (i) redistributing resources to the different applications

as needed, and (ii) accommodating the execution of new

unforeseen applications, even when the resources they need

might have already been allocated to others.

This paper presents xSpark (extended Spark), an ex-

tended version of Spark that enables the fast and ﬁne-grained,

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 2

dynamic provisioning of resources to single or multiple big-

data applications running on shared infrastructure. If we fo-

cus on a single application, xSpark allows one to dynamically

acquire and release resources to meet set deadlines precisely.

This way xSpark reduces the resources needed to execute an

application without over-provisioning, and helps take into

account contingent situations like failures and unexpected

delays. If we focus on multiple applications, it redistributes

provisioned resources and can also help minimize deadline

violations by supporting different scheduling policies —

based on actual needs.

xSpark is based on two key enablers. On the one hand,

containers —a lightweight virtualization technology [19],

[20]— allow for the isolation of processing components

(executors) and for the fast and ﬁne-grained provisioning

of resources (vertical scalability) [21], [22]. On the other

hand, control theory provides the machinery to govern the

allocation at run time with the required precision and speed.

xSpark starts with an initial approximate resource allocation,

and then employs hierarchical, heuristic-based and control

theoretical planners to adjust the provisioned resources as

the applications execute. Runtime controllers enact fast, ﬁne-

grained resource allocations: the CPU time of each Spark

executor is allocated with a control period of just

0.5

seconds.

Memory is also dynamically provisioned by exploiting off-

heap allocation. Our evaluation shows that xSpark can

allocate resources dynamically to allow applications to meet

deadlines with an error that is always less than

2.5%

when

dealing with homogeneous data and less than

5.5%

with

skewed data. It also demonstrates that xSpark can reduce

deadline violations with respect to Spark when controlling

multiple concurrent applications.

To summarize, the main contributions of this paper

are: (i) a container-based extension of Spark, called xSpark,

(ii) off-heap-based dynamic memory management, (iii) a

hierarchical planner for runtime resource allocation based

on a heuristic and a gray-box control-theoretical model, (iv)

an additional control-theoretical supervisor that oversees the

execution of multiple applications on the same framework,

and (v) a thorough evaluation in which we considered both

single and multiple applications concurrently.

The rest of the paper is structured as follows. Section 2

introduces Spark and the way it works. Section 3 presents

xSpark. Section 4 describes the heuristic used for the pre-

liminary resource allocation, while Section 5 focuses on the

control-theoretical solution used to adjust the initial plans

at runtime. Section 6 explains how xSpark can allocate

resources to control the execution of multiple concurrent

applications. Section 7 evaluates the proposed solution and

xSpark, Section 8 surveys related approaches, and Section 9

concludes the paper.

2 SPAR K

Spark is deployed on a cluster of (virtual) machines and

employs a master/worker architecture. The Driver Program

(i.e., the big-data application to execute) starts by creating a

Spark Context that interacts with the Master Node to manage

the parallel computation. The Master Node is the manager

of the actual computing resources, which are called Worker

Nodes. Each worker node is installed on a dedicated machine

and contains an Executor that runs for the entire lifetime

of the application. The executor performs multiple tasks in

parallel using a thread pool, and the number of parallelized

tasks depends on the number of CPU cores it has been given.

Different applications may share the same cluster. Each

would have its own Spark context; master and worker nodes

would be shared, but executors would be assigned to their

respective applications before execution. Tasks can persist

the results of their computations on a distributed storage

layer (e.g., HDFS [23]), hosted on the Spark cluster, or on

dedicated machines.

To fully comprehend Spark one must understand driver

programs. The simple Python program of Figure 1 analyzes

a text ﬁle in which each line has the following structure

class:word

, where

word

is an English word of a speciﬁc

class

: verb, adjective, name, etc. The goal is to identify the

words that are not verbs, and that share the same ﬁrst and

last letters. Line

creates context

by providing the URL

(

local

, in this case) of the master node and the name of

the application (

example

). The statement starting at line

comprises

Spark operations. Besides reading from ﬁle

dataset.txt

(line

), it splits each line in the original data

set into two parts: word and actual class (lines

and

). It

then proceeds to create lists of words that share the same

class (line

), and eliminates the list of

verb

s (line

). The

remaining lists are ﬂattened to create

, that is, the list that

contains all the words that are not verbs. The statements

at lines

and

use

to create

and

. They are both

lists of tuples of the form

pc,wordsq

where

is the ﬁrst/last

character of the

words

. Finally, line

computes the

result

by performing the cartesian product of

and

to obtain a

list of tuples of the form

ppcf,wordsfq,pcl,wordslqq

, where

and

stand for the ﬁrst and last characters, respectively. It

then proceeds to perform the set intersection of

wordsf

and

wordsl

to ﬁnd all the words that start and end with the same

letters. The result is collected at line 15.

Spark operations manipulate RDDs (Resilient Distributed

Dataset). An RDD is an immutable and fault-tolerant collec-

tion of records that is split into multiple redundant partitions

to facilitate parallel computation. Operations can be of two

1 from pyspark import SparkContext

2 sc = SparkContext(’local’,’example’)

3 x = sc.textFile("dataset.txt")

4 .map(lambda v: v.split(":"))

5 .map(lambda v: (v[0], [v[1]]))

6 .reduceByKey(lambda v1, v2: v1 + v2)

7 .filter(lambda (k,v): k != "verb")

8 .flatMap(lambda (k, v): v)

9 y = x.map(lambda x: (x[0], x))

.aggregateByKey(list(),

10 lambda v1,v2: v1+[v2],

lambda v1, v2: v1+v2)

11 z = x.map(lambda x: (x[-1], x))

.aggregateByKey(list(),

12 lambda v1,v2: v1+[v2],

lambda v1, v2: v1+v2)

13 result = y.cartesian(z)

.map(lambda ((k1,v1), (k2, v2)):

((k1+k2), list(set(v1) & set(v2))))

14 result = result.collect()

Figure 1: Example Spark code.

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 3

Job 0

textFile

stage 0

reduceByKey

stage 1

ﬂatMap

reduceByKey

stage 2

ﬂatMap

aggregateByKey

stage 3

aggregateByKey

cartesian

map

ﬁlter map

map

Figure 2: The DAG of our example application.

kinds: transformations create new RDDs (e.g., map, ﬁlter, etc.),

while actions perform computations that generate values (e.g.,

count, ﬁrst, collect, etc.). Spark treats the former lazily, that

is, it chains them together for optimization purposes, and

only truly performs them when an action is encountered.

This makes Spark particularly efﬁcient when executing

iterative algorithms (e.g., machine learning applications or

computations on graphs).

Spark starts executing a program by identifying jobs,

delimited by the presence of actions in the code, and stages

(within jobs), delimited by operations that require data to

be shufﬂed (i.e., moved among executors), thus breaking

locality. Indeed, Spark distinguishes between narrow and

wide transformations speciﬁcally for this purpose; the former

do not shufﬂe data (e.g., map, ﬁlter, etc.), while the latter do

(e.g., reduceByKey, etc.). Spark “identiﬁes” all the operations

that are to be executed, up to the ﬁrst action, and materializes

them as a directed acyclic graph (DAG)

. The DAG deﬁnes

the execution order among stages, and deﬁnes the extent to

which stages can be executed in parallel. Note that a task

performs all the operations of a particular stage on a partition

of the input RDD of that stage. As previously stated these

tasks are executed in parallel, depending on the number of

available executors and on the number of CPU cores assigned

to each executor.

Figure 2 shows the single job that is created from

the example code (there is only one action, i.e., the ﬁnal

collect) and its four stages.

Stage0

comprises the operations

from line

, and ends with the reshufﬂing operation

reduceByKey of line

. Since statements at line

and

use

the RDD generated by this last operation (they both exploit

), and RDDs are immutable, they could evolve in parallel,

but the parallelism starts after reshufﬂing, that is, just after

the end of the stage. This is why the operations at lines

and

are duplicated and are part of both

Stage1

and

Stage2

which also comprise the speciﬁc maps that help create

and

. Since operations aggregateByKey at line

and

require

data shufﬂing, they cannot be part of the parallel stages but

they initiate

Stage3

. This last stage aggregates the data from

the two preceding stages, computes the cartesian product,

and applies the ﬁnal map transformation. The intersection of

the two sets is plain Python and thus is not part of the DAG.

The DAG is not the control-ﬂow graph of the job’s code. It does not

contain branches and loops since Spark has already resolved them.

3XSPARK

To extend Spark with dynamic resource provisioning, xS-

park

modiﬁes both its architecture and processing model.

xSpark focuses on stages. Instead of considering complete

applications, xSpark reasons on per-stage deadlines as a

means to decompose the overall execution time. Since

stages are composed of diverse operations with different

degrees of complexity, they must be modeled and controlled

individually. xSpark creates dedicated executors for each

stage, instead of general-purpose executors that can execute

the tasks of any stage, as Spark would normally do. This

way, the resources that are (dynamically) provisioned to a

given executor will only impact the performance of the stage

that it is associated with. This gives xSpark a ﬁne-grained

control of the execution of the different stages, and thus of

the whole application. Moreover, when a stage is submitted

for execution, one executor per worker node is created and

bound to that stage. This allows xSpark to equally distribute

the computation and the data among the whole cluster.

xSpark uses containers (Docker

) to isolate multiple

executors running on the same worker node, and to al-

locate CPU cores and memory (using Linux cgroups [24]).

In particular, xSpark uses CPU quotas to allocate small

fractions of cores

to containers/executors. Spark statically

preallocates the memory that is associated with each executor,

and thus with each application. xSpark, on the other hand,

distinguishes between heap memory, which can only be

allocated statically, and off-heap memory, which can be

managed dynamically; a small portion of heap memory is

reserved for each application, while the remaining memory

is assigned dynamically through off-heap memory.

To operate, xSpark requires an appropriately annotated

DAG of the entire application. For each stage, the DAG must

contain: the execution time, that is, the stage’s duration, the

number of identiﬁed tasks per stage, the number of input

(read) records, the number of output (written) records, and

the nominal rate, i.e., the number of records that a core

processes in a second. These numbers can come from an

initial proﬁling execution, which is often a viable solution

since Spark applications are usually not executed only once,

and tend to be long-lasting assets. Annotated DAGs are stored

in the Annotated Application DAG Repository and are retrieved

as soon as an application is submitted for execution. Note that

xSpark does not require that provided annotations be precise

with respect to the used input data since they only serve as

an initial assumption; xSpark will cope with imprecisions

and changes at runtime.

3.1 Architecture

Figure 3 shows xSpark’s architecture. Gray boxes represent

components that we added to Spark, while dark-gray boxes

correspond to control components. White boxes represent

the existing components that we modiﬁed.

xSpark is based on four controllers. A single, centralized

Memory Manager (Section 3.2) is deployed on the master

node; it manages how memory is shared among running

The source code of xSpark is available at https://github.com/

deib-polimi/xSpark

3. https://www.docker.com

4. Quotas assign a speciﬁc quota of the CPU period to a container.

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 4

Annotated Application

DAG Repository

Master Node

Application 2

Context

Monitor Planner Control

Enactor

Stage Scheduler

Task Scheduler

Application 1

Context

Monitor Planner Control

Enactor

Stage Scheduler

Task Scheduler

Memory

Manager

Container Manager

Supervisor

Worker Node N

Container

Controller

Executor

Container

Controller

Executor

Container Manager

Supervisor

Worker Node 1

Container

Controller

Executor

Container

Controller

Executor

Figure 3: Architecture of xSpark.

applications. As previously stated it considers both heap

memory, which cannot be varied at runtime, and off-heap

memory, which can be added and removed on demand.

Each application is controlled by a heuristic-based Plan-

ner (Section 4), implemented within the master node, that

oversees the execution of the different stages of the applica-

tion. For each stage, it computes a deadline, calculates the

amount of CPU cores needed to meet it, and assigns them to

the executors allocated to the stage. Context Monitor oversees

the life-cycle of the stages scheduled by Stage Scheduler,

while Control Enactor creates a containerized executor for

the submitted stage on each worker node, with the memory

limits imposed by Memory Manager.

Unfortunately, many factors can inﬂuence the actual

performance and invalidate the estimation: for example, the

quantity of ﬁltered-out records, available memory, number

of used nodes, size of storage layer, etc. This is why each

executor is equipped with a Local Controller (Section 5),

based on control theory; it is used to ﬁx these imprecisions by

dynamically modifying the amount of CPU cores assigned to

the executor. These controllers interpret the estimated dead-

lines from Planner as desired durations or set-points, and

aim to allocate just the right amount of resources (i.e., ideally,

the minimum amount) to meet the local deadline. xSpark

requires that all the executors working on the same stage

process the same number of tasks to avoid synchronizing the

controllers that are dedicated to the same stage since they

work on the same deadline and data quantity. By combining

these lightweight controllers and the fast vertical scalability

of containers, xSpark is able to achieve control periods of less

than a second.

The executors created for different applications can be

executed simultaneously and share the resources of the same

worker nodes: for example, Figure 3 shows the case of two

applications that share the same cluster. Moreover, new

applications could be submitted for execution while others

are already running and may saturate the cluster. To cope

with these issues an additional Supervisor (Section 6) is

added to each worker node to oversee the local controllers

and solve possible resource contentions among the different

applications5Different policies are supported.

Supervisors also allow xSpark to speed up the compu-

tation of stages when the load is low by allocating more

resources than the ones strictly needed to satisfy the desired

progress rates (e.g., the cores computed by the local con-

trollers). The speed up mechanism is entirely conﬁgurable.

3.2 Memory Management

At application-submission time, Spark requires users to

specify the amount of memory to dedicate to each executor.

When dealing with a single application, choosing this value

is simple. In general, picking a larger value will reduce the

probability of disk swap and improve performance. However,

we still need to pay attention to the system load, because if

the heap tries to grow, and there is not enough free memory

in the system, the executor (i.e., the JVM) will crash.

Choosing the right amount of memory for each executor

when dealing with multiple applications is more complex.

Hypothetically, if one could know in advance the number

of applications being submitted to the cluster, s/he could

partition the available memory among them. Since this is not

always feasible, one can either under-provision the running

applications to save memory for possible future applications,

or allocate the memory dynamically.

Unfortunately one cannot resize the heap memory of a

JVM at run time: one would need to kill the process and

restart it with a new conﬁguration. Knowing that Spark

postpones the launch of an application if the requested

allocation of memory is not satisﬁable, we need to pay

attention when choosing the application’s memory value.

Assigning a high amount of memory would cause application

executions to be serialized, while deciding for a lower value

would allow for a higher number of applications in parallel,

at the price of an increased risk of disk swapping.

One possible solution is to use off-heap memory to make

memory boundaries ﬂexible; even if the best performance

is obtained when operating with on-heap memory, Spark

can use off-heap allocation both for execution and storage.

Off-heap memory refers to objects that are managed directly

by the operating system and stored outside the process’ heap,

that is, they are not processed by the JVM’s garbage collector.

Accessing off-heap data is slightly slower than accessing

on-heap data, but it is faster than reading from and writing

on a disk [26].

Since Spark does not provide a way to dynamically

resize off-heap memory, xSpark adds the Memory Manager

component. xSpark launches an application with a relatively

small amount of heap memory and dynamically resizes

the provisioned off-heap memory according to the number

of applications running at any given time. Given a set of

running applications

, and the total memory of the cluster

M,Memory Manager uses the following strategy:

‚

The quantity of heap memory

allocated to an applica-

tion is ﬁxed and conﬁgurable by the user;

‚

Given the allocated heap memory

h˚|A|

, the remaining

portion

O“M´h˚ |A|

is equally distributed to

Recall that each worker node manages the same applications and

thus contentions can be resolved in a distributed way.

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 5

the running applications through the off-heap memory

mechanism, meaning that each application is set to use

at maximum

o“O

|A|

off-heap memory in addition to

;

‚

When a new application is submitted for execution, the

total off-heap memory becomes

O1“O´h

and each

application reduces its off-heap quota (o1“O1

|A|`1);

‚

When an application terminates, it frees its heap and off-

heap memory

O2“O1`h

, and the other applications

increase their off-heap quotas to o2“O2

|A|´1.

4 PL ANNER

Each application uses a heuristic-based planner to compute

per-stage deadlines. The only input is the foreseen deadline

(execution time) for the entire application. When a stage

sPS

—where

is the set of the application’s stages— is

submitted for execution, xSpark computes a preliminary

deadline using the following formula:

deadlines“α¨appDeadline ´spentT ime

(1)

where

spentT ime

is the time already spent on executing

the application,

appDeadline

is the deadline submitted

by the user, and

is a constant, between

and

, that

xSpark uses to set the level of conservativeness foreseen to

respect the deadline. If

α“1

, the execution time will be

controlled (by the lower control levels) to meet the deadline

exactly, while with smaller values for

the control will

be more conservative and allow for possible delays due to

imprecisions in the control

, the weight of the stage, is

computed as follows:

ws“β¨ |Rs|`p1´βq ¨ ds(2)

where

is the set of the stages still to be executed,

included, and

, the ratio between the sum of the proﬁled

durations of the stages in

and the proﬁled duration of

itself. Hence:

ds“řRs

rdurationr

durations

(3)

Since the performance measured during proﬁling may be

different from the one seen at run time, we mitigate the

heuristic by means of constant

. To ﬁnd a proper value for

, we performed a sensitivity analysis (see Figure 4) using

four benchmark applications that we also used to evaluate

xSpark (Section 7).

We deﬁned the error on the application deadline (a

negative value implies a violation) as:

ϵa“100 ¨appDeadline ´appDuration

appDeadline %(4)

and the error on a stage deadline as

ϵs“100 ¨ˇˇˇˇ

stageDeadline ´stageDuration

appDeadline ˇˇˇˇ

%(5)

Note that when the application deadline is missed, the heuristic

computes negative or null stage deadlines, and the estimated number of

cores will be equal to the number of available ones.

Application

Deadline Error %

1.25

2.5

3.75

Beta

0.2

0.4

0.6

0.8

aggr-by-key

sort-by-key

PageRank

KMeans

(a) Impact on ϵa

Stage!

Deadline Error %

1.25

2.5

3.75

Beta

0.2

0.4

0.6

0.8

aggr-by-key

sort-by-key

PageRank

KMeans

(b) Impact on avgpϵsq

Figure 4: Sensitivity analysis for beta.

where

app|stageDeadline

is the desired execution

time of the application/stage, respectively. Similarly,

app|stageDuration

represents the actual execution time.

Note that Equation 4 does not use the absolute value to

distinguish between delays and early terminations, and

Equation 5 uses the entire application’s deadline, and not the

stage’s deadline itself, as the denominator, to have a more

relevant indicator of the control error. We also deﬁned the

average and the standard deviation of

ϵs

for all the stages of

an execution of an application as avgpϵsqand stdpϵsq.

Figure 4 shows how

ϵa

and

avgpϵsq

change with respect

to different values of

(ranging between

and

). Even

though we could not identify an optimal value for all the

applications we used, the values between

0.2

and

0.4

were

reasonably adequate, and thus we set

0.3

. This way

we avoid over-ﬁtting proﬁling data, which is what would

happen if we only used

. Note that all

are computed at

run time, since the order of stage execution cannot be known

a-priori if there are parallel threads in the DAG. The number

of CPU cores to be allocated for executing the stage in a time

that is equal to the computed stage deadline (i.e., minimum

amount of cores that avoids the deadline violation) is then

estimated as:

estdCoress“rinputRecordss

deadlines¨nomRates

s(6)

where

inputRecordss

is the number of records that must be

processed by

, and

nomRates

provides the nominal rate of

, i.e., the number of records processed by a single core per

second (obtained during proﬁling).

inputRecords

depends

on the sum of the data written by the parent nodes in the

DAG (i.e., the sum of records produced by the parents).

The ﬁnal step computes the initial number of cores and

the number tasks to be processed by each different executors.

xSpark distributes the load equally amongst all the available

workers by creating one executor per stage per worker. This

way each executor holds, and thus remotely reads during

shufﬂes the same amount of data meaning that xSpark can

compute the same deadline for all the executors (as seen by

using function

deadline

). The initial number of cores per

executor is computed as follows:

intlCoresExecs“restdCoress

cq ¨numExecutorss¨cq (7)

where

numExecutors

is the number of executors (which

is always equal to the number of workers), and

stands

for core quantum, a constant that deﬁnes the quantization

applied to the resource allocation. The smaller the value, the

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 6

more precise the allocation is. In our experiments we set this

constant to

0.05

to allocate cores with a precision of up to

0.05 cores and obtain a low error.

This is only the foreseen, initial core allocation for each

local controller, which may continuously change it at runtime.

The tasks are then distributed equally (excluding remainders)

to the different executors. Since clearly not all deadlines are

feasible, given the provided resources and input datasets,

xSpark conducts a feasibility check before starting any

execution by means of the aforementioned heuristic and

warns the user if it is the case. We acknowledge that more

sophisticated and precise approaches exist (e.g., [27], [28]),

but they are on average too slow for a preliminary check.

5 LO CAL CONT ROLLE RS

Each containerized executor has its own local controller

to fulﬁll a per-stage deadline in the face of exogenous

disturbances, by dynamically allocating CPU cores. The cen-

tralized heuristic-based control loop determines the desired

duration, the maximum and the initial cores to allot for

each executor and the number of tasks to be processed prior

to the execution of a stage. The local controllers will then

continuously adjust the numbers of cores allocated to the

executors, according to the work already accomplished and

to the desired completion rate.

Executors that are dedicated to different stages are

independent, and therefore so are their controllers. Executors

that work in parallel on the same stage must complete the

same amount of work (i.e., number of tasks) in the same

desired time; this means that their local controllers are also

independent and do not need to communicate with one

another. Given their decentralized nature, we concentrate

on the design of a single controller, while the interactions of

controllers with their respective Supervisors are extensively

detailed in Section 6.

5.1 Controlled System

To derive a model of the controlled system we started with

the following two assumptions.

Assumption 1: at steady-state, with constant resource

allocation and disturbances, the progress rate is also

constant, and is a function —

fp¨q

to name it— of the

allocated resources and of the disturbances. This function

is generally non-linear, but regular enough for the values

of interest of the involved quantities.

This assumption is technically required to express the

mathematical model. Later in this section we will show

how the resulting algorithm can perfectly cope with time-

varying disturbances and allocations. For completeness

we also notice that progress rates may also depend on

available memory. However, memory is either sufﬁcient,

thus allocating more is useless, or it is not sufﬁcient. In

the latter case there is a performance degradation, but this

depends on ﬁne-grained effects connected to cache, swap,

and so on, and are best viewed as disturbances at this level.

Memory management is especially critical when dealing

with multiple applications in parallel. A description on

how xSpark dynamically provisions memory is provided

in Section 3.2.

Assumption 2: the output of

fp¨q

exerts its effect

through an asymptotically stable, linear, time-invariant

dynamic system with unity gain and relative degree.

This assumption is met if the reaction of the controlled system

to a modiﬁcation in the resource allocation is faster than the

control period. In literature the control period is usally set

in the order of minutes, but the use of containers, which can

be resized in hundreds of milliseconds, allows for a control

period of a second or less, while preserving the hypothesis

above. In the absence of signiﬁcant actuation delays thanks to

containers, we can safely suppose that the system’s relative

order is one, but we cannot know anything about its dynamic

structure. Therefore, we assume the simplest possible form

with one pole and no zeros, that is:

ρpkq “ pρpk´1q`p1´pqfpcpk´1qq (8)

where

is the progress rate and

the pole. Having

in the

r0, 1q

range ensures that the control system is

asymptotically stable, as for that

|p| ă 1

would be sufﬁcient;

the choice to limit

to positive values further ensures that the

reference completion rate will be tracked without oscillations,

i.e., with good regularity.

The model is affected by bounded uncertainty, but afﬁne

and time-invariant. It belongs to the Hammerstein class,

which opens to interesting generalizations. There is a lot

of literature on the identiﬁcation of Hammerstein models,

from works like [29] to the time-varying case, which could

be useful in the future to generalize our results [30]–[32].

To date, we obtain

fp¨q

’s bounds through proﬁling, while

in (8) is estimated using step response analysis.

5.2 Control Synthesis

The progress set point is chosen based on the desired

completion time, which is received from the upper control

layer as illustrated in Figure 5.

tco

is the desired completion

time (per-stage deadline minus the time at the start of the

elaboration) and ϕP p0, 1sis a conﬁguration parameter that

determines the extent to which the control will attempt to

complete the execution earlier than the deadline, as a safety

measure (as αin Equation 1 but for stage-deadlines).

Prescribed completion %,

100

0, elaboration starts tco, req. compl. time

φtco

Figure 5: Set point generation for a local controller.

To design the control law, we denote by

the con-

trol timestep —i.e., the time between two calculations of

cpkq

— and express the dynamics between

cpkq

and the

accomplished completion percent

a%pkq

; for the data

epkq

elaborated up to step

(the rate between

k´1

and

decided at k´1) we get:

epkq “ epk´1q ` qρpk´1q,(9)

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 7

so that indicating with

the total amount of data to

process, a%pkqevolves as:

a%pkq “ 100epkq

Do“a%pk´1q ` 100q

ρpk´1q.(10)

In ztransfer function form (8) and (10) respectively read:

ρpzq

upzq“1´p

z´p,a%pzq

ρpzq“100q{Do

z´1(11)

where

upkq “ fpcpkqq

. The transfer function from

that is the linear part of the dynamics seen by the controller,

is thus: a%pzq

upzq“100qp1´pq{Do

pz´1qpz´pq.(12)

To track the ramp set point showed in Figure 5, the loop

transfer function must have two poles in

z“1

, which can

be achieved by a PI (Proportional plus Integral) controller of

the form:

Cpzq “ Kz´a

z´1.(13)

Since speciﬁcations in terms of durations are given in

time units (e.g., seconds) and not as numbers of sampling

periods, it is convenient to re-interpret the control loop in

the continuous time. To do that we back-apply the forward

difference method and we obtain the stransfer function:

Pcpsq “ µP

sp1`sτPq,µP“100Kf

Do ,τP“q

1´p,(14)

whereas for the controller we have:

Ccpsq “ µCp1`sτCq

s,µC“Kp1´aq

q,τC“q

1´a.

(15)

0ω

dB |L(jω)|

τC=1

ητP

τP

Kf,nom

cutoff frequency ωc

Figure 6: Bode magnitude diagram of the required open-loop

frequency response.

We need to force the magnitude Bode diagram of the

open loop frequency response

Lpjωq “ CcpjωqPcpjωq

behave like in Figure 6. We substitute

fp¨q

with a (positive,

bounded, unknown) gain

, for which we assume a

nominal value

Kf,nom

, and make the controller time constant

τC

proportional by

ηą1

τP

, selecting the controller gain

so that the cutoff frequency —for

Kf“Kf,nom

— is the

logarithmic mean of

τC

and

τP

. This provides the nominal

cutoff frequency ωcand phase margin φmas:

ωc“1

η?τP

,φm“arctan p?ηq´arctan ˆ1

?η˙.(16)

Higher values of

yield higher margins but also a lower-

frequency zero

in controller

(13)

; this in turn would result in

control kicks in the presence of step-like set point variations

—which do not occur according to Figure 5— and in longer

disturbance recovery times. Omitting lengthy computations,

once the continuous-time controller is tuned this way and

converted back to discrete time, we get:

K“Dop1´pq

100qKf?η,a“η`p´1

η(17)

to put into

(13)

. With a value of

around ten —corresponding

φm

around

55˝

— as a reasonable default, the controller

behaves satisfactorily provided that

is small enough with

respect to the required completion time, which is plainly a

matter of reasonableness. The discrete-time controller in state

space form reads:

#xCpkq “ xCpk´1q`p1´aq`a˝

%pk´1q ´ a%pk´1q˘

rcpkq “ KxCpkq ` K`a˝

%pkq ´ a%pkq˘(18)

where

a˝

%pkq

is the prescribed progress percentage and

rcpkq

the computed core allocation at each

control step. As a ﬁnal

remark, in a real application it may transiently happen that

the controller computes a negative

rcpkq

, or one exceeding

the number of available CPU cores. Denoting by

cmin

and

cmax

the minimum and maximum number of available cores

in the worker,

rcpkq

needs to be clamped within the two, and

the state of

(18)

has to be recomputed to maintain consistency

with the input and output. This is done by:

xCpkq “ cpkq

K´a˝

%pkq ` a%pkq.(19)

Note that

stands for requested cores since this value is

read and eventually modiﬁed by the worker’s Supervisor as

detailed in Section 6.1.

Finally, we need to assess a couple of important properties

of the control system. The asymptotic stability refers to nom-

inal conditions, i.e., when the model describes the controlled

system exactly. With the controller as deﬁned in Equation 18,

phase margin considerations allow us to state that the

closed-loop system is guaranteed to be asymptotically stable,

provided that the reduction of the said margin due to the

sampling and holding is not excessive. To this end, one could

set a maximum

φm

reduction

δφm

, and then bound

to the

upper limit

2δφm{3ωc

, with

δφm

in radians, as usually done

in digital controls.

As for the nominal performance, the nominal response

time

τr

of the closed loop is the inverse of the cutoff

frequency, i.e.,

τr“η?τP.(20)

The time to reach a new set point, or to recover from a

step-like disturbance like the abrupt unavailability of a core

—provided the goal is still attainable— can be quantiﬁed as

times

τr

. This is ﬁne because

τP

is the time constant used by

the step-by-step resource allocation to act on the processing

speed, thus in a well sized system it is small with respect to

the control-relevant time scale.

Finally, as for applicability caveats, we need to distinguish

between model errors due to mismatch and variability. In

the former case the controlled system does not change its

behavior, but the model does not represent it exactly. In the

latter, the model may even start out as a perfect replica of the

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 8

system, but then this can change its behavior. In our case

variability should be scarcely relevant, as the typical task

consists of elaborating a huge amount of data, but the single

operation is simple and self-similar. If this assumption is

heavily violated, however, the applicability of our solution

may be questioned.

As for mismatch, the main point is whether or not the

structural assumptions made on the model are reasonable.

Again, this can be assessed by off-line proﬁling, or through

simulation if a reliable enough model of the applications

being considered is available. In this respect it is difﬁcult

to make general statements on the applicability of our

technique, except that once the convenient off-line testing is

carried out, no post-deployment issues are to be expected.

6 SUPERVISORS

When dealing with multiple applications running at the same

time we need to take into account two possible cases: either

we know the workload (i.e., the applications that will be

executed in parallel) in advance, or we do not.

In the former case, one can statically allocate resources so

that all applications have enough (or at least some) resources.

This a-priori allocation could be wrong or sub-optimal, for

example, if one application does not really need all the

resources that it is given or it needs more. While in the former

case we only have a waste of resources, in the latter case

we might witness under-performance and missed deadlines.

Furthermore, if all resources are pre-allocated there is no

room for unplanned applications that may be submitted for

execution while the others are already running.

As previously stated, xSpark adds a new hierarchical

control layer, implemented by multiple distributed Supervi-

sors to deal with these scenarios. This is needed to oversee

how the executors manage concurrent applications when the

resources provided by the cluster are not enough to allow

each application to meet its deadline.

6.1 Controlling Local Controllers

We deploy a Supervisor to each worker node; it is responsible

for managing the resource demands of executors (local

controllers) running on that machine. Every local controller

continues to autonomously determine the CPU cores needed

to follow its application’s desired progress rate, but the

Supervisor can decide to modify this value according to the

state of the resources.

Local controllers that are deployed to the same machine

are synchronized and have the same control period. The

Supervisor collects resource allocation requests in vector

¯rc

(requested cores), where

rci

is the speciﬁc request made by

application

, and produces a new core allocation as vector

¯cc (computed cores):

¯cc “γ˚¯ac ` p1´γq ˚ ¯rc (21)

where

¯ac

is an allocation vector that uses all available re-

sources. Parameter

γP r0, 1s

allows us to boost the execution

speed. If

γ“1

, the allocation saturates all resources, and

applications will be executed faster than planned. If

γ“0

the allocation only considers the resources requested by the

local controllers. This way we keep the cluster’s utilization

App 1

# cores

App 2

# cores

Safe Area

Contention Area

Figure 7: Example of two resource requests made by two

applications running in a shared environment; the maximum

number of allocatable cores is 8.

low and we save resources. The different strategies that can

be adopted to compute ¯ac are described in Section 6.2.

Since applications are not aware of one another, the Super-

visor needs to check whether there is resource contention. If

the total amount of requested cores (

), that is the sum of all

rci

, is less than (or equal to) the cluster’s available resources

(

cmax

), there is no contention. If

is greater than

cmax

, we

have contention and the requests cannot all be satisﬁed: we

must correct the amount of resources to attempt to manage

the contention. The two cases can be easily visualized by

considering two applications on a two dimensional plane (see

Figure 7). On the two axes we have the resources requested

by the two applications. In this example, the maximum

amount of resources that can be allocated consists of

CPU

cores. The white dot represents a feasible allocation, because

the sum of the requested cores (

3`2“5

) is lower than

The black dot, on the other hand, represents an infeasible

allocation, since the sum (

4`7“11

) is greater than

. The

region above the thick line represents all the combinations

of resource requests that cause contention. In this latter

case, we need to ﬁnd a new pair

ac1

and

ac2

, such that

ac1`ac2“cmax

, where

aci

is the value in vector

¯ac

that

corresponds to the ith application.

Finally since the number of CPU cores that the executor

will acquire might not be what the executor controller

calculated, we also need to update the state of the local

controllers to maintain consistency otherwise the next control

operations would be based on an incorrect previous state.

6.2 Strategies for Core Allocation

Multiple strategies, that take into account the different static

and/or dynamic characteristics of the involved applications

(e.g., deadlines or nominal rates), can be adopted when

distributing the available cores. The Supervisor can use these

strategies (i.e., how

¯ac

is computed) to resolve contention

and to speed-up the computation (i.e., to set

). Note

that

is part of the initial conﬁguration, and thus is set

before starting any computation (and before knowing of any

possible contention).

We advocate that one can make a parallelism between

the problem of allocating resources to xSpark executors and

the preemptive online scheduling of sporadic tasks, with

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 9

arbitrary deadlines, in a real-time multiprocessor system.

In this latter case, it has been proven ( [33]–[35]) that no

optimal on-line scheduling algorithm exists for sporadic task

sets with constrained or arbitrary deadlines. Therefore, we

decided to focus on sub-optimal approaches.

One of the most popular dynamic-priority planning-

based on-line algorithms is called Earliest Deadline First

(EDF) [36]; it uses deadlines to determine the priorities of

tasks. We can adopt this same strategy and use execution

deadlines to determine priorities when allocating resources

for multiple applications, but we can also build more sophis-

ticated strategies. Table 1 compares the different strategies

we have devised, using three example applications, under

the assumption that only 16 cores are available.

Earliest Deadline First "All" (EDF

all

). This strategy

allocates all the available resources to the application with the

nearest upcoming deadline, even if the application requested

fewer resources. In general, this approach completes the

execution of a single application before allocating resources

to the next one, and completes the applications in a certain

order, as deﬁned by the proximity of their deadlines. The

algorithm requires the maximum number of allocatable cores

(

cmax

), and the time to complete (

ttci

) and remaining tasks

(rti) of each application.

Table 1 shows that the only application that actually ac-

quires resources is the one with the smallest time to complete

(deadline). In this case, we have made the assumption that

the tasks each application still has to execute can use all the

cluster’s cores.

Earliest Deadline First "Pure" (EDF

pure

). This strategy

uses priorities to allocate resources to applications. An

application’s priority is deﬁned by its remaining time to

complete: the shorter the time, the higher the priority. With

this strategy applications with low priorities may be paused.

Again, the algorithm requires the maximum number of

allocatable cores (

cmax

), as well as the time to complete (

ttci

)

and the requested cores (

rci

) for each application. Table 1

shows that the application with the shortest time to complete

acquires all the cores it asked for; the second application

obtains all the remaining cores, which are fewer than the

requested ones. The third application is not granted any

resource as they are depleted.

Earliest Deadline First "Proportional" (EDF

prop

). This

strategy assigns a weight to each running application. The

weight is related to its remaining time to complete

ttci

and

calculated as

wi“1´ttci´minp¯

ttcq ` 1

řI

jrttcj´minp¯

ttcq ` 1s

where

ttc

is the vector composed by all the

ttci

and

iPI

the set of applications running at that time. If there is not

a huge difference in terms of remaining times to complete,

all the applications will acquire a portion of the available

resources. The algorithm requires the requested number of

cores for each application (

rci

) and the maximum number

of allocatable cores (

cmax

), and produces the applications’

weights (

). Requested cores are taken into account to avoid

giving an application more resources than actually requested,

even when its calculated weight is higher than that of other

applications. Table 1 shows all applications receive a certain

amount of cores, and none of them is paused.

Proportional. This strategy represents the rawest way

to allocate available cores. In this case, the weights are

calculated as

wi“rci

řI

jrcj

where

is the set of applications running at a given time, and

iPI

. This solution creates a fair distribution of resources

since no application is preferred over the others. Table 1

shows that the weights are directly proportional to the

amount of cores requested by the applications: the ﬁnal

allocation is proportional to the amount of requested cores.

Speed. This strategy takes into account the applications’

average nominal rates, that is, the number of input records

each application can process per second per core. This value

can be obtained, for example, by proﬁling the application

or can be inferred in other ways. An application’s average

nominal rate is computed as

anri“řS

snomRates¨ws

řS

sws

where

is the set of stages of the application,

sPS

nomRates

is the nominal rate of stage

, and

is its

weight (see Formula 2). An application’s weight can then be

computed as

wi“ANR

anri

where

iPI

, the set of applications running at a given time,

and

ANR

is the average among

anri

. Table 1 shows that the

three applications require

cores, but only

are available,

and that

ANR

is equal to

, that is, the average among

the three

anri

. Since applications A and B have the same

nominal rate, they also have the same weight. C has a lower

nominal rate —half the one of A and B— and its weight is

therefore double that of A and B. As a result, C obtains half

the cluster’s cores, while A and B receive one quarter each.

The evaluation presented in Section 7.2 highlights that

there is no single strategy that can outperform all the others

in all possible scenarios. The strategy to use will depend

on the requirements one wants to meet. If the goal is to

minimize deadline violations (i.e., delays) one of the EDF-

based strategies with

γ“1

will be preferable. If the goal is to

minimize errors (i.e., missed deadlines and anticipated ones)

and thus minimize resources, one should consider EDF

pure

EDF

prop

with

γ“0

, or Proportional with

γ“1

. Finally,

Speed might be the best choice if the applications are highly

heterogeneous in terms of nominal rates.

Further strategies can considered and easily implemented

in xSpark given its modular and hierarchical architecture.

7 EVALUATI ON

This section describes the experiments we conducted to evalu-

ate xSpark. All the experiments used Azure Standard_D14_v2

VMs with 16 CPUs, 112 GB of memory, and 800 GB of local

SSD storage. This kind of VM is optimized for memory

usage, with a high memory-to-core ratio. Each machine ran

Canonical Ubuntu Server 14.04.5-LTS, Oracle Java 8, Apache

Hadoop 2.7.2, Apache Spark 2.0.2 and xSpark. We dedicated

ﬁve VMs to HDFS and ﬁve to Apache Spark and xSpark. The

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 10

Application EDFall EDFpure EDFprop Proportional Speed

name rcittciaciaciwiaciwiacianriwiaci

A10 50 16 10 0.97 7.75 0.33 5.28 6M0.83 4

B8 60 0 6 0.67 5.35 0.27 4.32 6M0.83 4

C12 70 0 0 0.36 2.90 0.40 6.40 3M1.66 8

Table 1: How the different strategies impact ¯ac in a simple example.

datasets were randomly generated with the goal of obtaining

homogeneous data; all executions were repeated ﬁve times

and we show average values.

7.1 Resource Allocation

To assess how xSpark allocates CPU cores dynamically we

used eight applications taken from two benchmark suites.

aggr-by-key,aggr-by-key-int,group-by,sort-by-key, and sort-

by-key-int stress basic aggregation and sorting capabilities

and come from SparkPerf

.KMeans,SVM, and PageRank

come from SparkBench

: the ﬁrst two are machine learning

applications, while the third is a well-known graph pro-

cessing solution. The ﬁrst ﬁve applications do not contain

branches or loops, while the last three are iterative, but

the number of iterations are conﬁgured at the beginning

through parameters. This means that all the executions of

each program have the same DAG (see Section 3).

We ﬁrst used Spark to run each application and set a

baseline (testBase), that is, to know the shortest execution

time given that Spark was conﬁgured to use all the resources

provided by the cluster:

cores in total in our case. The

datasets were randomly generated by the benchmark suites,

using the application speciﬁc parameters reported in Table 2.

App Parameters

aggr-by-key scaleF actor “5, keys “5000, tasks “5000

aggr-by-key-int scaleF actor “5, keys “5000, tasks “5000

group-by scaleF actor “5, keys “5000, tasks “5000

sort-by-key scaleF actor “50, keys “5000, tasks “5000

sort-by-key-int scaleF actor “60, keys “5000, tasks “5000

KMeans

iterations “1, partitions “1000,

dimensions “20, numClusters “10,

numP oints “100000000, scaling “0.6

PageRank iterations “1, partitions “1000

numV ertices “35000000

SVM iterations “1, partitions “1000

numExamples “150000000, f eatures “10

Table 2: Benchmarks conﬁguration.

We then used xSpark conﬁgured as shown in Table 3

to have a fair comparison against Spark. We executed each

application by imposing the same deadlines as the baseline

executions (test0%), and by relaxing the original deadlines by

20% (test20%) and 40% (test40%), respectively. Since xSpark

works as Spark, deadlines cannot be tighter than the baselines

7. https://github.com/databricks/spark-perf

8. https://github.com/SparkTC/spark-bench

without adding resources. The goal of these experiments was

thus to assess the precision with which xSpark can meet

deadlines, and how it can optimize the allocation of cores.

Param Value Range Description

γ0r0, 1sIncrement of execution speed (Eq. 21).

α1r0, 1sAdherence to app. deadlines (Eq. 1).

ϕ1r0, 1s

Adherence to stage deadlines (Sec. 5.2).

β0.3 r0, 1sDivergence from proﬁling (Eq. 2).

cq 0.05 p0, 8q

Quantization of core allocation (Eq. 7).

q0.5 sec. Control period (Eq. 9).

Table 3: xSpark parameters used for the experiments.

To better explain how xSpark works, Figure 8 shows the

behavior of a randomly-selected executor in charge of the

nine stages of PageRank (test40%). The black and gray lines

refer to the left-hand y-axis and show, respectively, the actual

percentage of stage completion (

a%pkq

in Equation 10) and

the prescribed one (

a˝

%pkq

in Equation 18), which is the set

point (at each control step

) of the local controller. The

blue line refers to the right-hand y-axis and shows the cores

allocated to the executor. The E-labeled green vertical lines

represent the actual stage ends, while the red dashed vertical

lines represent stage deadlines as computed by the planner;

the deadline for the last stage is also the deadline of the

entire application.

Core [#]

100

120

140

160

Time [s]

100

Stage Progress [%]

E0 E1 E2 E3 E4 E5 E6 E7 E13

Figure 8: Example of how xSpark executors work.

This chart shows how xSpark fulﬁlls stage deadlines with

an error that is close to

. At runtime, the local controllers

foresee the allocation of core fractions to executors: when the

actual progress of the stage is lower than the prescribed one,

allocated cores are increased, while as soon as it becomes

higher, they are quickly decremented. As already said, local

controllers exploit a control period of

0.5

seconds and xSpark

can thus be quite precise.

Table 4 shows the precision and resource utilization with

which xSpark carried out the different experiments. The

error in fulﬁlling set deadlines is always less than 2.5% for

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 11

Test ϵaavgpϵsqall util

aggr-by-key test0% -0.90% 3.00% 60.41 94.40%

aggr-by-key test20% 0.78% 1.05% 37.63 58.80%

aggr-by-key test40% 0.26% 0.25% 29.15 45.55%

aggr-by-key-int test0% -2.31% 4.81% 62.22 97.22%

aggr-by-key-int test20% 0.77% 0.85% 39.21 61.28%

aggr-by-key-int test40% 0.77% 0.52% 30.86 48.23%

group-by test0% -1.73% 5.67% 61.57 96.21%

group-by test20% 0.28% 0.60% 42.95 67.12%

group-by test40% 0.31% 0.36% 34.59 54.05%

sort-by-key test0% -1.31% 0.79% 59.19 92.50%

sort-by-key test20% 1.18% 0.41% 45.23 70.68%

sort-by-key test40% 0.44% 0.31% 35.44 55.37%

sort-by-key-int test0% -2.48% 1.32% 59.19 92.50%

sort-by-key-int test20% 0.32% 0.21% 46.44 72.56%

sort-by-key-int test40% 0.42% 0.15% 38.29 59.84%

KMeans test0% -0.92% 1.71% 54.43 85.05%

KMeans test20% 0.89% 0.94% 43.08 67.32%

KMeans test40% 0.58% 0.72% 36.15 56.49%

PageRank test0% -1.20% 2.70% 58.77 91.83%

PageRank test20% 0.71% 1.03% 44.76 69.95%

PageRank test40% 0.61% 0.68% 37.93 59.27%

SVM test0% 0.71% 2.69% 52.62 82.23%

SVM test20% 1.49% 1,58% 41.62 65.03%

SVM test40% 1.91% 0.98% 34.39 53.74%

Table 4: Precision and performances of xSpark.

complete applications (

ϵa

), while it can be slightly higher for

single stages (

avgpϵsq

). This is caused by the fact that stages

can be very heterogeneous, but the errors compensate each

other and the overall error is close to 0%.

xSpark slightly violated the deadline (the negative values

ϵa

) only when considering as deadline the execution

time obtained on Spark (test0%) since we had to trade

the ﬁne-grained, dynamic allocation capabilities for a bit

of performance. If the heuristic computed a stage deadline

that is slightly longer than the fastest execution time, xSpark

would not allocate all the resources, and the actual execution

becomes a bit slower. These violations can easily be avoided

by setting

to a reasonable value less than

, and allow

xSpark to consider stricter, more conservative deadlines. This

is like asking xSpark to work a bit faster and then be able

to meet deadlines even in case of errors or delays. Further

experiments suggested that with

α“0.95

xSpark never

violated set deadlines.

When deadlines are relaxed, xSpark meets them with an

error that is always less than

. The gain in used resources

is signiﬁcant even when xSpark works with the baseline

execution times. Table 4 shows the number of allocated cores

per second, where all is the average value of allocated cores

(i.e., all is the ratio between the integral of used cores over

the whole execution and the execution duration), and util is

the percentage of used resources (cores) with respect to the

baseline (64 cores).

These numbers help us compare the average resource

allocation of xSpark with respect to Spark. Even if Spark

provides some rudimentary mechanisms for dynamic re-

source allocation (see Section 8), in our experiments it always

used all

cores. xSpark instead allocates the resources

according to deadlines, and even with the strictest response

times (test0%), it was able to use between 2.78% and 17.77%

fewer resources than Spark. This is due to the fact that

xSpark can immediately release resources when not needed.

In particular, the highest saving was with SVM (17.77%),

since in some stages the available degree of parallelism was

not fully exploited.

Column

all

shows a signiﬁcant decrement in used re-

sources when relaxing deadlines, but the experiments witness

that there is no “easy” relationship between desired execution

times and appropriate amount of resources to achieve them:

a manual, experience-based allocation could then be tedious

and quite imprecise.

7.1.1 Data Skew

Test ϵaavgpϵsqstdpϵsq

group-by test20% (s“1) 5.44% 2.86% 2.51%

group-by test20% (s“2) -4.31% 2.83 % 1.72%

group-by test20% (s“3) -4.50% 2.46% 2.00%

Table 5: Precision and performance with data skew.

Data skew causes tasks (e.g., key-based operations) to

have different durations [37]. Even if managing skewed data

is out of the scope of this paper, this problem impacts and

degrades both the performance of Spark (by around 20%

according to [38]), and the control precision of xSpark. Since

xSpark monitors the progress of stage execution (progress rate)

by comparing the number of executed tasks against the total

number of tasks to execute, signiﬁcantly different durations

hamper this estimation.

Before thinking of improving our controllers (e.g., by

considering solutions already proposed for MapReduce [39]),

we wanted to evaluate how xSpark deals with skewed data.

We ran application group-by on three different sets of skewed

data, generated using Zipf’s law [40]: given

values ordered

by frequency, the frequency of a value is equal to

1{ks

řN

11{ns

where

kP r1, Ns

is the position of the value in the sequence

and

sP r1, 3s

identiﬁes the degree of skewness of the data

set (the higher the more skewed). We chose application

group-by since it does not pre-compute any intermediate

result/reduction in parallel that may smooth the impact of

skewness on the ﬁnal computation;

reduce-by-key

for

example would do that.

Table 5 shows the results of our experiments. Again,

we ﬁrst ran the application on Spark to obtain the baseline

deadline, and then performed

test20%

with xSpark. The

table shows that xSpark can meet the deadline when

equal to

with an error of

5.44

%. When

increases to

and

xSpark slightly violates the deadline (

is negative) with an

error of

4.40

% on average. As mentioned before, the error is

higher than when executing applications not (or just slightly)

affected by skewed data because of the aforementioned

imprecisions of progress monitoring. Moreover, when the

data are skewed the control of xSpark is less effective since

the duration of a stage is heavily impacted by its long tasks

that cannot be parallelized.

Figure 9 shows how skewed data impact the execution

of a stage. The ﬁgure shows two different behaviors on

processing the second stage of application group-by imposed

by skewed data. The progress rate of the ﬁrst executor

(line A in the ﬁgure) moves very quickly to 100%, since

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 12

it only receives short tasks that process low-frequency data.

In contrast, the second executor (line B) shows a step-like

progress rate. The executor is in charge of both short tasks,

when the progress increases at a very fast rate (e.g., after

120

and

220

seconds), and three longer ones where progress

is constant. In both cases, to complete the execution before

the deadline (see the expected progress in light gray), a few

cores are needed since the degree of parallelism is limited

by data skew. In the ﬁrst case, xSpark immediately releases

allocated resources after completing the stage, while Spark

would wait for a predeﬁned amount of time to do it (see

Section 8). In the second case, the execution lasts as foreseen,

but it only uses a few cores. xSpark releases unused cores as

soon as they are not needed anymore, while Spark cannot,

since resources are associated with executors and are only

released when they terminate. Note that since Spark uses the

same executor to process different stages, one cannot allocate

resources speciﬁcally for a particular stage, but must always

consider possible worst cases.

Figure 9: Execution with skewed data.

7.2 Concurrent Applications

To assess the problem of dynamic resource allocation with

concurrent applications, we conﬁgured xSpark to use its

Memory Manager, which was disabled with the previous

experiments (since single applications should not have

memory allocation problems). For these experiments, we

used composite benchmarks, that is, sequences of applica-

tions separated by time delays. Applications were grouped

according to their origins (SparkPerf and SparkBench). We

also selected and tested deadlines that would have been

feasible if the applications were executed alone, but not

necessarily satisﬁable when running multiple applications

on the same cluster.

The ﬁrst experiment aimed to assess how xSpark handles

the concurrent execution of applications. We used two

conﬁgurations for xSpark: EDF

all

with

γ“1

, to maximize

the resources allocated to the application with the earliest

deadline; and EDF

pure

with

γ“0

, to ensure that the

application with the earliest deadline is given enough

resources to complete its execution on time. Each Spark

application gets an independent set of executors, which

only run tasks and store data for that application. The

same happens with xSpark, but the main difference is that

Spark, by default, runs all submitted applications in FIFO

order, each consuming all available resources, unless the

user manually conﬁgures the resources allocated to each

Spark

EDF

all,=1

EDF

pure,=0

wait

PAGERANKKMEANSSVM

5940 145

222

200

340

300

wait

230 324

145 196161

289

time [s]

Figure 10: Concurrent execution of Benchmark 1.

application statically at submission time. Instead, xSpark

parallelizes the execution of the different applications and

allocates the resources to both fulﬁll deadlines and minimize

resource usage.

Table 6 reports the results for the ﬁrst benchmark, where

∆

indicates the amount of time we waited after the beginning

of the experiment before submitting an application for

execution. Both

∆

and

appDeadline

, are deﬁned in seconds,

and

appDealine

is expressed as a duration. Figure 10 shows

how the concurrent execution was handled by each system.

Due to the FIFO scheduling of applications in Spark, and

the fact that every application allocates all the resources in

the cluster, we can see that the deadline requested for SVM

cannot be satisﬁed. Moreover, as shown in Figure 10, the

execution of the last application actually begins when its

deadline is almost expired, since it needs to wait for the two

previous applications to release their resources. Even if Spark

was equipped with a non-FIFO application scheduler, one

that always selects the application with the closest deadline,

the situation would not have changed. When Spark can start

KMeans, SVM has yet to be submitted for execution, and

thus the only pending application is the one that is started.

Instead, xSpark allows us to satisfy all the three proposed

deadlines in both conﬁgurations.

Conﬁguration App ∆appDeadline ϵa

Spark

PageRank 0 300 80.3%

KMeans 40 300 65.0%

SVM 80 120 ´18.8%

EDFall

γ“1

PageRank 0 300 74.0%

KMeans 40 300 36.6%

SVM 80 120 32.5%

EDFpure

γ“0

PageRank 0 300 3.6%

KMeans 40 300 5.3%

SVM 80 120 3.3%

Table 6: Benchmark 1.

Choosing strategy EDF

all

with

γ“1

allows us to satisfy

all the deadlines in the benchmark, as we can see in Figure 10.

However, due to the nature of this strategy, we have high

deadline errors: Table 6 says that PageRank completes with a

deadline error of

74.0%

. In contrast, if we selected EDF

pure

and

γ“0

(again, as shown in Figure 10), we increased

the execution time of each application with respect to the

previous case, but we obtain a smaller deadline error: all

applications terminate with a deadline error that is less than

5.3% (see Table 6).

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 13

This experiment allowed us to also assess the penalty

introduced by our memory management based on off-heap

memory. Recall that the management of off-heap memory

is slower that the management of on-heap memory, but

if one only used on-heap memory, and preallocated all

available memory, no new application would be executable

because of lack of memory. Our memory management must

then be considered as a trade-off between dynamic memory

allocation and performance.

Figure 10 says that Spark completes the execution of the

three applications in

222

secs, while xSpark

all,γ“1

after

230

secs, with a penalty of less than

. Even if this is not a

thorough evaluation, given the limitations in managing the

heap memory of JVMs, we advocate that xSpark provides a

viable solution for memory management at run time when

the number of running applications cannot be estimated

precisely beforehand.

Benchmark App ∆appDeadline nomRate

Bench #2

(SparkBench)

PageRank 0 250 671K

KMeans 40 160 5142K

SVM 80 250 46K

Bench #3

(SparkPerf)

aggr-by-key 0 120 319K

group-by 40 200 486K

aggr-by-key-int 80 160 267K

Table 7: Benchmarks 2 and 3.

Strategy γ#A #D avgp|ϵa|q ϵaA ϵaD

EDFall 03 0 12.4% 37.2% 0.0%

EDFall 13 0 37.2% 111.7% 0.0%

EDFpure 03 0 6.6% 19.8% 0.0%

EDFpure 13 0 12.0% 36.1% 0.0%

EDFprop 03 0 6.7% 20.1% 0.0%

EDFprop 13 0 17.1% 51.2% 0.0%

Proportional 0 2 1 6.8% 7.5% ´13.0%

Proportional 1 3 0 13.7% 41.2% 0.0%

Speed 0 3 0 5.5% 16.5% 0.0%

Speed 1 3 0 18.1% 54.4% 0.0%

Table 8: Results for Benchmark 2.

We also ran two additional experiments to evaluate

the ﬁve strategies supported by xSpark with

equal

both

and

, for a total of

options. Table 7 shows the

conﬁgurations of the two additional benchmarks. Again,

the tested deadlines are feasible when the applications

execute alone, but cannot be satisﬁed when running multiple

applications together on the same cluster.

Table 8 shows the results for the ﬁrst benchmark

(

Bench#2

), where

indicates the number of applications

that completed their execution before the deadline, while

is the number of applications that completed with a

delay.

avgp|ϵa|q

is the average of the absolute value of

ϵa

, for

the three applications, while

ϵaA

is the sum of the errors of

the applications that ﬁnished in advance and

ϵaD

is the sum

We decided to focus on the two extreme values, but any value

in-between could be applicable.

of the errors of those that completed with a delay. The table

shows that, for most of the conﬁgurations, we were able to

complete their execution before set deadlines: The value in

column #A is 3in 9out of 10 cases.

The smallest deadline errors —and no violations— in

this experiment were achieved with strategies EDF

pure

(

avgp|ϵa|q “ 6.6

) and Speed (

avgp|ϵa|q “ 5.5

), both with

γ“0

. Strategy Proportional (with

γ“0

) also shows a small

avgp|ϵa|q

but one, out of three, applications ended with a

delay (

#D“1

). This might not be a problem if we want to

ensure a fair distribution of resources across the applications,

at the price of (slightly) violating some of the deadlines.

Furthermore, strategy Proportional with

γ“1

increased the

number of applications that end in advance. Choosing

γ“1

is not always the best choice; it is simply a way to speed up

the computation if future contention is expected. As a result,

if we consider EDF

all

, we move from

avgp|ϵa|q “ 12.4

with

γ“0

avgp|ϵa|q “ 37.2

with

γ“1

, which is about three

times greater.

Table 9 shows the results for the second benchmark

(

Bench#3

). The workload of the three applications is too

high for the allocated resources, resulting in at least one

violation with any strategy. Since we were not sure we could

eliminate the delays, we decided to examine how Spark

would behave if it had an EDF-like task scheduler for its

applications. In particular, we tried to give all the resources to

a single application, since we wanted to mimic the behavior

of EDF

all

with

γ“1

, but with the advantage of knowing the

workload a-priori. We called this approach Clairvoyance EDF:

we used the logs produced by Spark to know exactly all the

applications to execute, along with the duration of all their

tasks. Our analyses with Clairvoyance EDF showed that only

two out of three applications can complete their executions

by the designated deadlines (

#A“2

and

#D“1

). This is

the same result obtained by EDF

all

(without knowing the

applications to execute and their duration in advance).

If this workload is run in a situation in which we have

a “strict” deadline, we need to minimize the number of

violations. As a result, we need to choose the strategy with

the smallest #D (

#D“1

). This ends up being either

EDF

all

or EDF

pure

with

γ“1

. On the other hand, if

the deadline is considered to be “soft”, we may want to

minimize the value of

ϵaD

by choosing EDF

all

with

γ“1

(

ϵaD“16.6

). If paying for more resources is as costly as

violating the deadline (or even preferred), we may want

to minimize the average deadline error

avgp|ϵa|q

. In this

composite benchmark the best choice would be to use either

EDFpure or EDFprop with γ“0.

These experiments show that there is no single strategy

that is always better than the others. To generalize, EDF

all

with

γ“0

is the best strategy when it comes to

avoiding deadline violations, while EDF

pure

with

γ“0

is preferable when not violating deadlines is as important as

saving resources. Additional, custom strategies could also be

conceived and added to xSpark without modifying its local

controllers and heuristic-based planners.

7.3 Threats to Validity

The experiments were conducted using eight different ap-

plications and the datasets were generated randomly using

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 14

Strategy γ#A #D avgp|ϵa|q ϵaA ϵaD

EDFall 02 1 20.2% 39.5% 21.2%

EDFall 12 1 24.4% 56.7% 16.6%

EDFpure 01 2 10.8% 5.8% 26.7%

EDFpure 12 1 11.5% 12.1% 22.5%

EDFprop 01 2 11.0% 5.8% 27.3%

EDFprop 11 2 10.8% 6.1% 26.4%

Proportional 0 1 2 12.6% 3.8% 34.0%

Proportional 1 1 2 11.7% 5.0% 30.2%

Speed 0 1 2 13.6% 5.5% 35.5%

Speed 1 1 2 11.5% 5.8% 28.6%

Table 9: Results for Benchmark 3.

well-known benchmarks. Even if xSpark improves Spark

with respect to different metrics, we must highlight threats

that may hamper the validity of our experiments ( [42]):

Internal Threats. Each application was proﬁled using

Spark and then executed using xSpark on the same dataset.

If the proﬁling datasets are (signiﬁcantly) different in terms

of data distribution from the one used to execute the

applications, the precision of the heuristic in computing

stage deadlines decrease, but the local controllers could still

cope with these imprecisions. We conducted some additional

experiments (group-by test20%) to asses that xSpark is still

able to effectively allocate resources dynamically in this

scenario. In particular, we obtained an error (

ϵa

) on average

equal to

´3.30

when controlling a dataset generated with

s“0.5

, but using the proﬁling of a uniform dataset. On

contrary when executing an application with uniform dataset

controlled with a proﬁling of skewed data (

s“0.5

) we

obtained an error that is on average equal to 0.86%.

More experiments are needed to correlate deadlines and

the optimal amount of resources to fulﬁll them. However,

this was not the goal of our work since we allocate resource

at runtime with our fast and ﬁne-grained control. Not

only does a static (and manual) approach have to ﬁnd out

the optimal allocation, using a comprehensive model or

past experience, without dynamic allocation it might be

impossible to optimize resources since resource demand is

non-constant. This is even truer when dealing with multiple

concurrent applications (the scheduling of which is often not

in control of the system administrator), that could contend

for resources at any time during their execution.

External Threats. Some of our assumptions may limit the

generalization of the experiments/solution. Even if Spark

was conceived to exploit in-memory processing, it relies on a

storage layer (usually HDFS), which may act as a bottleneck.

The ﬁrst assumption is that HDFS must be sized properly: in

real world scenarios this is not always the case; to avoid the

problem we decided to dedicate the same number of VMs to

HDFS as the ones used to run the executors. Moreover, some

of the stages of our test applications were conceived to carry

out a high number of parallel read/write operations towards

the storage layer to stress its performance, and we observed

no signiﬁcant delays.

The second assumption is that memory is considered to

be enough. The more data are processed in parallel, that

is, the more cores a VM offers, the more data must be

loaded in memory. This is a general requirement of any big-

data solution and we see it as part of conﬁguring the VMs

properly, rather than a possible threat to the generalization

of our experiments.

The last assumption refers to the type of Spark applica-

tions we considered. xSpark takes into account the core API

and the graph and machine learning libraries of Spark. We

also conducted some preliminary experiments with another

well-known benchmark suite called TPC-H

that exploits

Spark SQL to implement business oriented queries. The

results are similar to those presented and show that xSpark

can successfully control this type of application —both

individually and in parallel. In contrast, xSpark cannot deal

with stream-based applications: their processing model is

different and one should think of special-purpose qualities

of service, rather than setting a deadline for completing a

given execution.

Construct and Conclusion Threats. The experiments

demonstrate the validity of our claims, i.e., that the ﬁne-

grained and fast resource allocation capabilities provided by

xSpark can provision resources precisely and efﬁciently to

multiple applications and have them meet set deadlines.

Obtained results are statistically robust, and only show a

small variance (see Table 4). We also used skewed data and

obtained similar results.

8 RE LATE D WORK

The work presented in this paper must be compared with

the results obtained in different research areas.

First of all, Ousterhout et al. [38] provide a comprehensive

analysis of the performance of various tools for data analytics.

As for Spark, they show that CPU allocation is often the

bottleneck. They quantify that network optimization reduces

execution time by

%, while if one optimizes disk usage,

there is a gain of

%. They also identify the Java garbage

collector and I/O transfers as signiﬁcant speed limitations

for big-data applications.

Spark itself provides limited capabilities to adjust the

resources (executors) dynamically allocated to tasks. The

External Shufﬂe Service allows Spark to save resources by

switching off the executors that remain idle for a user-

deﬁned amount of time; they can then be switched-on again

if tasks remain idle for too long. This dynamism however

is only limited to considering the executors preallocated to

an application (it cannot borrow additional executors from

the system), and works at executor level, that is, it cannot

manage CPU cores.

As for additional resource management solutions, Spark

can be paired with external resource managers —such as

Mesos [43] and YARN [44]. Mesos sends resource offers (push-

based scheduler) to its clients and manages both CPU cores

and memory, while YARN waits for resource requests (pull-

based scheduling) and only considers memory. These systems

do not support any application-speciﬁc policy for resource

management. One could think of using our controllers on

top of these systems for this. They both support containers to

launch executors, but they do not offer any form of vertical

10. http://www.tpc.org/tpch/

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 15

scalability. Mesos also offers an optional ﬁne-grained mode,

where each task is containerized, but the runtime overhead is

heavy: this is why the use of ﬁne-grained resource allocation

is deprecated in Spark 2.0.

Lakew et al. [46] and Barna et al. [47] use containers

as means to allocate resources dynamically. Similarly to

our past work [3], Lakew et al. [46] exploit containers

and model-predictive control and system identiﬁcation to

support vertical elasticity [1] and target different KPIs and

resource dimensions. Barna et al. [47] target autonomic,

containerized multi-tier applications. They exploit layered

queuing networks to create self-tuning controllers for ap-

plications composed of heterogeneous components such

as web services, databases, and big data elements. These

solutions could be used to manage the resources allocated to

a complete Spark instantiation, but they cannot manage the

resources allocated to the different applications since they

have no visibility of them. xSpark can do that since besides

working on dynamic resource management, we have also

changed the architecture and processing model behind Spark

to work at a lower granularity level.

Moving to complementing the execution of big-data

applications with deadlines, we can only mention a few

works. AROMA [15] is a deadline-aware tool for resource

inference and allocation of MapReduce applications on the

cloud. It uses a two-phase machine learning and optimization

framework. Adaptation matches resource utilization with

previously executed jobs and makes provisioning decisions

accordingly. Cura [49] is another tool based on a new

MapReduce cloud model for data analytics in the cloud.

It leverages MapReduce proﬁling to automatically create the

best cluster conﬁguration and to optimize global resource

consumption. Malekimajd et al. [50] provide upper and lower

bounds for MapReduce job execution times in Hadoop based

on a linear programming model.

As for Spark, Gibilisco et al. [18] propose a performance

model for DAG-based applications that allows one to have an

accurate prediction of how the application will behave given

a speciﬁc data size and certain conﬁguration settings; they

do not provide dynamic resource management. Marconi et

al. [28] describe a formal model to verify the feasibility of set

deadlines given the DAG of the application, the input dataset,

and the resources available. Sidhanta et al. [17] introduce

OptEx, an optimization model to conﬁgure a Spark cluster

according to time and cost constraints. Their approach is

static, they support VM-only clusters and do not consider

data skew.

Islam et al. [16] propose another static resource allocation

system called dSpark. It solves an optimization problem

to compute different possible resource allocation schemas,

with different costs and resources required; the user then

selects the one s/he prefers. dSpark cannot manage multiple

applications, runtime contentions, and data skew, and the

allocation is limited to VMs. Even if these works have some

limitations, they could be complementary to our solution.

Currently, we use our fast heuristic to preallocate resources,

but more sophisticated solutions could be adopted.

The same complementarity applies to the works that

monitor the execution of big data applications [39], [51]–[54].

For example, Morton et al. [39] propose ParaTimer, a progress

indicator for MapReduce DAGs. They estimate the progress

of complex MapReduce applications that are translated into

a DAG of jobs. They use a critical path algorithm to ﬁnd the

longest sequence of tasks to be processed. They also handle

data skew and failures by providing a set of estimations that

consider different scenarios.

Different approaches exist for scheduling multiple big-

data applications (mostly MapReduce applications) on a

shared cluster. Kc et al. [55] present an approach that sched-

ules Hadoop jobs to meet QoS deadlines. They model the

execution cost of each task, taking into account both process-

ing and data transfer times, and then estimate the number of

Map and Reduce jobs required to satisfy the deadline. Polo et

al. [56] also provide a solution for the performance-driven co-

scheduling of the tasks of diverse MapReduce applications.

They introduce a new task scheduler that can dynamically

estimate the cost of parallel task executions, and dynamically

reallocates resources without distinguishing between Map

and Reduce jobs. [57] is another scheduling technique that

dynamically builds performance models of the workloads

and uses them to inform the adaptive scheduler when

numerous applications are competing for shared resources.

While these solutions address the problem of scheduling

applications given the executors (and thus the resources

reserved to them), xSpark considers resources ﬁrst and then

computes a feasible scheduling given the policy adopted for

allocating resources at runtime.

9 CONCLUSIONS AND FUTURE WORK

The paper proposes a solution for enriching Spark with ﬁne-

grained dynamic resource allocation, and presents xSpark as

a supporting prototype infrastructure. The proposed solution

considers CPU cores and memory and allows for their

optimized allocation when the framework is used to execute

both single and multiple applications. The assessment we

conducted witnesses both a better utilization of resources

and a reduced number of violated deadlines.

As for future work, we would like to extend our approach

to consider disk and network usage. We will also keep study-

ing how to combine containers and control theory to manage

an infrastructure that hosts heterogeneous applications (e.g.,

web services and big-data frameworks).

ACKNOWLEDGMENTS

This work was supported with grant by project EEB -

Ediﬁci A Zero Consumo Energetico In Distretti Urbani Intel-

ligenti (Italian Technology Cluster For Smart Communities) -

CTN01_00034_594053 and by the GAUSS national research

project (MIUR, PRIN 2015, Contract 2015KWREMX).

REFERENCES

[1]

S. Dustdar, Y. Guo, B. Satzger, and H.-L. Truong, “Principles of

Elastic Processes,” IEEE Internet Computing, vol. 15, 2011.

[2]

P. Padala, K. G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal,

A. Merchant, and K. Salem, “Adaptive Control of Virtualized

Resources in Utility Computing Environments,” in Proc. of the

2nd ACM European Conference on Computer Systems. ACM, 2007.

[3]

L. Baresi, S. Guinea, A. Leva, and G. Quattrocchi, “A Discrete-time

Feedback Controller for Containerized Cloud Applications,” in Proc.

of the 24th ACM SIGSOFT International Symposium on Foundations of

Software Engineering. ACM, 2016.

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 16

[4]

N. Roy et al., “Efﬁcient Autoscaling in the Cloud Using Predic-

tive Models for Workload Forecasting,” in Proc. of the 4th IEEE

International Conference on Cloud Computing. IEEE, 2011.

[5]

D. Ardagna, B. Panicucci, and M. Passacantando, “A Game

Theoretic Formulation of The Service Provisioning Problem in

Cloud Systems,” in Proc. of the 20th international conference on World

wide web. ACM, 2011.

[6]

C. Klein et al., “Brownout: Building More Robust Cloud Appli-

cations,” in Proc. of the 36th International Conference on Software

Engineering. ACM, 2014.

[7]

D. A. Menascé, “QoS Issues in Web Services,” IEEE Internet

Computing, vol. 6, 2002.

[8]

R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud

Computing and Emerging IT Platforms: Vision, Hype, and Reality

for Delivering Computing As the 5th Utility,” Future Generation of

Computing Systems, vol. 25, 2009.

[9] “Apache Hadoop,” http://hadoop.apache.org, 2017.

[10]

“IBM InfoSphere,” https://www-01.ibm.com/software/data/

infosphere/, 2017.

[11]

M. Isard et al., “Dryad: Distributed Data-parallel Programs from

Sequential Building Blocks,” in Proc. of the 2nd ACM European

Conference on Computer Systems, 2007.

[12]

A. Verma, L. Cherkasova, V. S. Kumar, and R. H. Campbell,

“Deadline-based Workload Management for MapReduce Environ-

ments: Pieces of the Performance Puzzle,” in Proc. of the 22th IEEE

Network Operations and Management Symposium. IEEE, 2012.

[13] “Apache Spark,” http://spark.apache.org, 2017.

[14]

A. Verma, L. Cherkasova, and R. H. Campbell, “ARIA: Automatic

Resource Inference and Allocation for Mapreduce Environments,”

in Proc. of the 8th ACM International Conference on Autonomic

Computing. ACM, 2011.

[15]

P. Lama et al., “AROMA: Automated Resource Allocation and

Conﬁguration of MapReduce Environment in the Cloud,” in Proc.

of the 9th International Conf. on Autonomic Computing. ACM, 2012.

[16]

M. Islam et al., “dSpark: Deadline-based Resource Allocation for

Big Data Applications in Apache Spark,” in Proc. of the 13th IEEE

International Conference on eScience, 2017.

[17]

S. Sidhanta, W. Golab, and S. Mukhopadhyay, “OptEx: A Deadline-

Aware Cost Optimization Model for Spark,” in 16th IEEE/ACM

International Symposium on Cluster, Cloud and Grid Computing, 2016.

[18]

G. P. Gibilisco et al., “Stage Aware Performance Modeling of

DAG Based in Memory Analytic Platforms,” in Proc. of IEEE 9th

International Conference on Cloud Computing. IEEE, 2016.

[19] “Docker,” http://docker.com, 2017.

[20]

D. Merkel, “Docker: Lightweight Linux Containers for Consistent

Development and Deployment,” Linux Journal, 2014.

[21]

W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An Updated

Performance Comparison of Virtual Machines and Linux Contain-

ers,” in Proc. of the 16th IEEE International Symposium on Performance

Analysis of Systems and Software, 2015.

[22]

S. Soltesz, H. Pötzl, M. E. Fiuczynski, A. Bavier, and L. Peterson,

“Operating System Virtualization: A Scalable, High-performance

Alternative to Hypervisors,” in Proc. of the 2nd ACM European

Conference on Computer Systems, vol. 41. ACM, 2007.

[23]

“HDFS Users Guide,” https://hadoop.apache.org/docs/stable/

hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html, 2017.

[24]

“Linux Manual: cgroups,” http://man7.org/linux/man-pages/

man7/cgroups.7.html, 2016.

[25]

M. Chen, S. Mao, and Y. Liu, “Big data: A survey,” Mobile Networks

and Applications, vol. 19, 2014.

[26]

N. Kozlowski, “Spark Memory Management Part 1:

Push It to the Limits,” https://www.pgs-soft.com/

spark-memory-management-part-1- push-it-to-the-limits, 2017.

[27]

E. Gianniti, A. M. Rizzi, E. Barbierato, M. Gribaudo, and

D. Ardagna, “Fluid Petri Nets for the Performance Evaluation

of MapReduce and Spark Applications,” SIGMETRICS Performance

Evaluation Review, vol. 44, 2017.

[28]

F. Marconi, G. Quattrocchi, L. Baresi, M. Bersani, and M. Rossi, “On

the Timed Analysis of Big-Data Applications,” in Proc. of the 10th

NASA Formal Methods Symposium. Springer, 2018.

[29]

K. Narendra et al., “An Iterative Method for the Identiﬁcation of

Nonlinear Systems Using a Hammerstein Model,” IEEE Transactions

on Automatic Control, vol. 11, 1966.

[30]

A. Nordsjo and L. Zetterberg, “Identiﬁcation of Certain Time-

varying Nonlinear Wiener and Hammerstein Systems,” IEEE

Transactions on Signal Processing, vol. 49, 2001.

[31]

J. Voros, “Identiﬁcation of Hammerstein Systems with Time-

varying Piecewise-linear Characteristics,” IEEE Transactions on

Circuits and Systems II: Express Briefs, vol. 52, 2005.

[32]

D. Guarin and R. Kearney, “An Instrumental Variable Approach for

the Identiﬁcation of Time-varying, Hammerstein Systems,” IFAC-

PapersOnLine, vol. 48, 2015.

[33] K. S. Hong et al., “On-line Scheduling of Real-time Tasks,” in Proc.

of the 9th Real-Time Systems Symposium. IEEE, 1988.

[34]

M. L. Dertouzos et al., “Multiprocessor Online Scheduling of Hard-

real-time Tasks,” IEEE Tran. on Software Engineering, vol. 15, 1989.

[35]

N. W. Fisher, The Multiprocessor Real-time Scheduling of General Task

Systems. The University of North Carolina at Chapel Hill, 2007.

[36]

J. A. Stankovic, M. Spuri, K. Ramamritham, and G. C. Buttazzo,

Deadline Scheduling for Real-time Systems: EDF and Related Algorithms.

Springer Science & Business Media, 2012, vol. 460.

[37]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “A Study of Skew

in MapReduce Applications,” Open Cirrus Summit, vol. 11, 2011.

[38]

K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun,

“Making Sense of Performance in Data Analytics Frameworks,” in

Proc. of the 12th USENIX Conference on Networked Systems Design and

Implementation. USENIX, 2015.

[39]

K. Morton, M. Balazinska, and D. Grossman, “ParaTimer: A

Progress Indicator for MapReduce DAGs,” in Proc. of the 36th

ACM International Conference on Management of Data. ACM, 2010.

[40]

J. Lin, “The Curse of Zipf and Limits to Parallelization: A Look at

the Stragglers Problem in MapReduce,” in Proc. of the 7th Workshop

on Large-Scale Distributed Systems for Information Retrieval, 2009.

[41]

K. G. Shin and P. Ramanathan, “Real-time Computing: A New

Discipline of Computer Science and Engineering,” Proceedings of

the IEEE, vol. 82, 1994.

[42]

C. Wohlin et al., “Empirical research methods in web and software

engineering,” Web Engineering, 2006.

[43]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,

R. Katz, S. Shenker, and I. Stoica, “Mesos: A Platform for Fine-

grained Resource Sharing in the Data Center,” in Proc. of the 8th

USENIX Conference on Networked Systems Design and Implementation,

ser. NSDI. USENIX, 2011.

[44]

V. K. Vavilapalli et al., “Apache Hadoop YARN: Yet Another

Resource Negotiator,” in Proc. of the 4th Annual Symposium on

Cloud Computing. ACM, 2013.

[45]

A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and

I. Stoica, “Dominant Resource Fairness: Fair Allocation of Multiple

Resource Types,” in Proc. of the 8th USENIX Conference on Networked

Systems Design and Implementation, 2011.

[46]

E. B. Lakew et al., “Kpi-agnostic Control for Fine-grained Vertical

Elasticity,” in Proc. of the 17th IEEE/ACM International Symposium

on Cluster, Cloud and Grid Computing. IEEE, 2017.

[47]

C. Barna, H. Khazaei, M. Fokaefs, and M. Litoiu, “Delivering Elastic

Containerized Cloud Applications to Enable DevOps,” in Proc. of

the 12th International Symposium on Software Engineering for Adaptive

and Self-Managing Systems. IEEE, 2017.

[48]

L. Baresi, S. Guinea, G. Quattrocchi, and D. A. Tamburri, “Mi-

croCloud: A Container-Based Solution for Efﬁcient Resource

Management in the Cloud,” 2016 IEEE International Conference

on Smart Cloud (SmartCloud), 2016.

[49]

B. Palanisamy et al., “Cura: A Cost-Optimized Model for MapRe-

duce in a Cloud,” in Proc. of the IEEE 27th International Symposium

on Parallel and Distributed Processing, 2013.

[50]

M. Malekimajd et al., “Optimal MapReduce Job Capacity Allocation

in Cloud Systems,” SIGMETRICS Perform. Eval. Rev., vol. 42, 2015.

[51]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and

I. Stoica, “BlinkDB: Queries with Bounded Errors and Bounded

Response Times on Very Large Data,” in Proc. of the 8th ACM

European Conference on Computer Systems. ACM, 2013.

[52]

S. Chaudhuri et al., “When Can We Trust Progress Estimators for

SQL Queries?” in Proc. of the 31th ACM International Conference on

Management of Data. ACM, 2005.

[53]

S. Chaudhuri, V. Narasayya, and R. Ramamurthy, “Estimating

Progress of Execution for SQL Queries,” in Proc. of the 30th ACM

International Conference on Management of Data. ACM, 2004.

[54]

K. Lee et al., “Operator and Query Progress Estimation in Microsoft

SQL Server Live Query Statistics,” in Proc. of the 42th ACM

International Conference on Management of Data. ACM, 2016.

[55]

K. Kc and K. Anyanwu, “Scheduling Hadoop Jobs to Meet

Deadlines,” in Proc. of the IEEE 2nd International Conference on

Cloud Computing Technology and Science. IEEE, 2010.

FINE-GRAINED, DYNAMIC RESOURCE ALLOCATION FOR BIG-DATA APPLICATIONS 17

[56] J. Polo et al., “Performance-driven Task Co-scheduling for MapRe-

duce Environments,” in Proc. of the 20th IEEE Network Operations

and Management Symposium. IEEE, 2010.

[57]

J. Polo, Y. Becerra, D. Carrera, M. Steinder, I. Whalley, J. Torres, and

E. Ayguade, “Deadline-Based MapReduce Workload Management,”

IEEE Transactions on Network and Service Management, vol. 10, 2013.

Luciano Baresi is a full professor at the Politec-

nico di Milano. Luciano was visiting professor

at University of Oregon (USA) and visiting re-

searcher at University of Paderborn (Germany).

His research interests are in the broad area

of software engineering and include formal ap-

proaches for modeling and speciﬁcation lan-

guages, distributed systems, service-based appli-

cations and mobile, self-adaptive, and pervasive

software systems.

Sam Guinea is an assistant professor in Software

Engineering at Politecnico di Milano. His research

interests include the runtime monitoring and

adaptation of service- and cloud-based systems.

He was the Scientiﬁc Director of the Master in

Cloud Computing and Agile Methodologies co-

organized by Cefriel and Politecnico di Milano in

2015. He is the Project Manager of the Polisocial

Award project “LYV - Lend Your Voice”.

Alberto Leva is an associate professor of Auto-

matic Control at Politecnico di Milano. His main

research interest concern methods and tools for

the automatic tuning of industrial controllers and

control structures, process modelling, simulation

and control. In recent years, he has been con-

centrating on control and control-based design

of computing systems, addressing problems like

scheduling and resource allocation.

Giovanni Quattrocchi received his Ph.D. in Com-

puter Engineering in 2018 from Politecnico di Mi-

lano, where he is currently a post-doc researcher.

He was a visiting researcher at University of

California San Diego and Imperial College Lon-

don. His research interests include self-adaptive

systems, software architectures, distributed sys-

tems, performance analysis, and mobile and

edge computing.

A Model and Survey of Distributed Data-Intensive Systems

Article

Full-text available

Jun 2023

Data is a precious resource in today’s society, and is generated at an unprecedented and constantly growing pace. The need to store, analyze, and make data promptly available to a multitude of users introduces formidable challenges in modern software platforms. These challenges radically impacted the research fields that gravitate around data management and processing, with the introduction of distributed data-intensive systems that offer innovative programming models and implementation strategies to handle data characteristics such as its volume, the rate at which it is produced, its heterogeneity, and its distribution. Each data-intensive system brings its specific choices in terms of data model, usage assumptions, synchronization, processing strategy, deployment, guarantees in terms of consistency, fault tolerance, ordering. Yet, the problems data-intensive systems face and the solutions they propose are frequently overlapping. This paper proposes a unifying model that dissects the core functionalities of data-intensive systems, and discusses alternative design and implementation strategies, pointing out their assumptions and implications. The model offers a common ground to understand and compare highly heterogeneous solutions, with the potential of fostering cross-fertilization across research communities. We apply our model by classifying tens of systems: an exercise that brings to interesting observations on the current trends in the domain of data-intensive systems and suggests open research directions.

Training and Serving Machine Learning Models at Scale

Chapter

Full-text available

Nov 2022

In recent years, Web services are becoming more and more intelligent (e.g., in understanding user preferences) thanks to the integration of components that rely on Machine Learning (ML). Before users can interact (inference phase) with an ML-based service (ML-Service), the underlying ML model must learn (training phase) from existing data, a process that requires long-lasting batch computations. The management of these two, diverse phases is complex and meeting time and quality requirements can hardly be done with manual approaches.This paper highlights some of the major issues in managing ML-services in both training and inference modes and presents some initial solutions that are able to meet set requirements with minimum user inputs. A preliminary evaluation demonstrates that our solutions allow these systems to become more efficient and predictable with respect to their response time and accuracy. KeywordsMachine learningRuntime managementService orchestration

Training and Serving Machine Learning Models at Scale

Preprint

Nov 2022

In recent years, Web services are becoming more and more intelligent (e.g., in understanding user preferences) thanks to the integration of components that rely on Machine Learning (ML). Before users can interact (inference phase) with an ML-based service (ML-Service), the underlying ML model must learn (training phase) from existing data, a process that requires long-lasting batch computations. The management of these two, diverse phases is complex and meeting time and quality requirements can hardly be done with manual approaches. This paper highlights some of the major issues in managing ML-services in both training and inference modes and presents some initial solutions that are able to meet set requirements with minimum user inputs. A preliminary evaluation demonstrates that our solutions allow these systems to become more efficient and predictable with respect to their response time and accuracy.

A Model and Survey of Distributed Data-Intensive Systems

Preprint

Full-text available

Mar 2022

Data is a precious resource in today's society, and is generated at an unprecedented and constantly growing pace. The need to store, analyze, and make data promptly available to a multitude of users introduces formidable challenges in modern software platforms. These challenges radically transformed all research fields that gravitate around data management and processing, with the introduction of distributed data-intensive systems that offer new programming models and implementation strategies to handle data characteristics such as volume, velocity, heterogeneity, and distribution. Each data-intensive system brings its specific choices in terms of data model, usage assumptions, synchronization, processing strategy, deployment, guarantees in terms of consistency, fault tolerance, ordering. Yet, the problems data-intensive systems face and the solutions they propose are frequently overlapping. This paper proposes a unifying model that dissects the core functionalities of data-intensive systems, and precisely discusses alternative design and implementation strategies, pointing out their assumptions and implications. The model offers a common ground to understand and compare highly heterogeneous solutions, with the potential of fostering cross-fertilization across research communities and advancing the field. We apply our model by classifying tens of systems and this exercise guides interesting observations on the current state of things and on open research directions.

Federated Machine Learning as a Self-Adaptive Problem

Conference Paper

Full-text available

May 2021

Resource Management for TensorFlow Inference

Chapter

Full-text available

Nov 2021

TensorFlow, a popular machine learning (ML) platform, allows users to transparently exploit both GPUs and CPUs to run their applications. Since GPUs are optimized for compute-intensive workloads (e.g., matrix calculus), they help boost executions, but introduce resource heterogeneity. TensorFlow neither provides efficient heterogeneous resource management nor allows for the enforcement of user-defined constraints on the execution time. Most of the works address these issues in the context of creating models on existing data sets (training phase), and only focus on scheduling algorithms. This paper focuses on the inference phase, that is, on the application of created models to predict the outcome on new data interactively, and presents a comprehensive resource management solution called ROMA (Resource Constrained ML Applications). ROMA is an extension of TensorFlow that (a) provides means to easily deploy multiple TensorFlow models in containers using Kubernetes b) allows users to set constraints on response times, (c) schedules the execution of requests on GPUs and CPUs using heuristics, and (d) dynamically refines the CPU core allocation by exploiting control theory. The assessment conducted on four real-world benchmark applications compares ROMA against four different systems and demonstrates a significant reduction ( \(75\%\)) in constraint violations and \(24\%\) saved resources on average.

PERIDOT: Modeling Execution Time of Spark Applications

Article

Full-text available

Aug 2021

A data analytics application submitted to a Spark cluster often has to finish executing by a specified deadline. To use cluster resources effectively, the key challenge is having the ability to gain quick insights on how the execution time of any given application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores and the size of the input data. Such insights can be used to quickly estimate the required resources needed for the desired execution time. Our paper proposes an automated execution time estimation approach called PERIDOT that involves executing a given application under a fixed resource setting with two small subsets of its input data to offer fast, lightweight execution time predictions. It analyzes these two executions to estimate the internal dependencies of the application and combines them with knowledge of Sparks data partitioning mechanisms to derive an analytic model that can estimate execution times for other resource settings and input data sizes. Our results from a wide range of applications and multiple Spark clusters show that PERIDOT can accurately estimate the execution time of an application from limited historical data, and suggest the minimum amount of resources required to meet an execution deadline.

SODALITE@RT: Orchestrating Applications on Cloud-Edge Infrastructures

Article

Full-text available

Sep 2021

IoT-based applications need to be dynamically orchestrated on cloud-edge infrastructures for reasons such as performance, regulations, or cost. In this context, a crucial problem is facilitating the work of DevOps teams in deploying, monitoring, European Commission grant no. 825480 (H2020), SODALITE. and managing such applications by providing necessary tools and platforms. The SODALITE@RT open-source framework aims at addressing this scenario. In this paper, we present the main features of the SODALITE@RT: modeling of cloud-edge resources and applications using open standards and infrastructural code, and automated deployment, monitoring, and management of the applications in the target infrastructures based on such models. The capabilities of the SODALITE@RT are demonstrated through a relevant case study.

Flexible Executor Allocation without Latency Increase for Stream Processing in Apache Spark

Conference Paper

Dec 2020

Quality Assurance of Heterogeneous Applications: The SODALITE Approach

Chapter

Mar 2021

A key focus of the SODALITE project is to assure the quality and performance of the deployments of applications over heterogeneous Cloud and HPC environments. It offers a set of tools to detect and correct errors, smells, and bugs in the deployment models and their provisioning workflows, and a framework to monitor and refactor deployment model instances at runtime. This paper presents objectives, designs, early results of the quality assurance framework and the refactoring framework.

dSpark: Deadline-Based Resource Allocation for Big Data Applications in Apache Spark

Conference Paper

Full-text available

Oct 2017

Large-scale data processing framework like Apache Spark is becoming more popular to process large amounts of data either in a local or a cloud deployed cluster. When an application is deployed in a Spark cluster, all the resources are allocated to it unless users manually set a limit on the available resources. In addition, it is not possible to impose any user-specific constraints and minimize the cost of running applications. In this paper, we present dSpark, a lightweight, pluggable resource allocation framework for Apache Spark. In dSpark, we have modelled the application completion time with respect to the number of executors and application input/iteration. This model is further used in our proposed resource allocation model where a deadlinebased, cost-efficient resource allocation scheme can be selected for any application. As opposed to the existing frameworks that focus more on modelling the number of VMs to use for an application, we have modelled both the application cost and completion time with respect to executors, hence providing a finegrained resource allocation scheme. In addition, users do not need to specify any application types in dSpark. We have evaluated our proposed framework through extensive experimentation, which shows significant performance benefits. The application completion time prediction model has a mean relative error (RE) less than 7% for different types of applications. Furthermore, we have shown that our proposed resource allocation model minimizes the cost of running applications and selects effective resource allocation schemes under varying user-specific deadlines.

KPI-Agnostic Control for Fine-Grained Vertical Elasticity

Conference Paper

Full-text available

May 2017

Delivering Elastic Containerized Cloud Applications to Enable DevOps

Conference Paper

Full-text available

May 2017

Following recent technological advancements in software systems, like microservices, containers and cloud systems, DevOps has risen as a new development paradigm. Its aim is to bridge the gap between development and management of software systems and enable continuous development, deployment and integration. Towards this end, automated tools and management systems play a crucial role. In this work, we propose a method to develop an autonomic management system for multitier, multi-layer data-intensive containerized applications based on a performance model of such systems. The model is shown to be robust and accurate in estimating and predicting the system's performance for various workloads and topologies, while the AMS is capable of regulating the application's behaviour by taking independent actions on its various parts.

MicroCloud: A Container-Based Solution for Efficient Resource Management in the Cloud

Conference Paper

Full-text available

Nov 2016

A discrete-time feedback controller for containerized cloud applications

Conference Paper

Full-text available

Nov 2016

Modern Web applications exploit Cloud infrastructures to scale their resources and cope with sudden changes in the workload. While the state of practice is to focus on dynamically adding and removing virtual machines, we advocate that there are strong benefits in containerizing the applications and in scaling the containers. In this paper we present an autoscaling technique that allows containerized applications to scale their resources both at the VM level and at the container level. Furthermore, applications can combine this infrastructural adaptation with platform-level adaptation. The autoscaling is made possible by our planner, which consists of a grey-box discrete-time feedback controller. The work has been validated using two application benchmarks deployed to Amazon EC2. Our experiments show that our planner outperforms Amazon's AutoScaling by 78% on average without containers; and that the introduction of containers allows us to improve by yet another 46% on average.

On the Timed Analysis of Big-Data Applications

Chapter

Mar 2018

Apache Spark is one of the best-known frameworks for executing big-data batch applications over a cluster of (virtual) machines. Defining the cluster (i.e., the number of machines and CPUs) to attain guarantees on the execution times (deadlines) of the application is indeed a trade-off between the cost of the infrastructure and the time needed to execute the application. Sizing the computational resources, in order to prevent cost overruns, can benefit from the use of formal models as a means to capture the execution time of applications. Our model of Spark applications, based on the CLTLoc logic, is defined by considering the directed acyclic graph around which Spark programs are organized, the number of available CPUs, the number of tasks elaborated by the application, and the average execution times of tasks. If the outcome of the analysis is positive, then the execution is feasible—that is, it can be completed within a given time span. The analysis tool has been implemented on top of the Zot formal verification tool. A preliminary evaluation shows that our model is sufficiently accurate: the formal analysis identifies execution times that are close (the error is less than 10%) to those obtained by actually running the applications.

Speculative Slot Reservation: Enforcing Service Isolation for Dependent Data-Parallel Computations

Conference Paper

Jun 2017

Fluid Petri Nets for the Performance Evaluation of MapReduce and Spark Applications

Article

May 2017
Perform Eval Rev

Big Data applications allow to successfully analyze large amounts of data not necessarily structured, though at the same time they present new challenges. For example, predicting the performance of frameworks such as Hadoop and Spark can be a costly task, hence the necessity to provide models that can be a valuable support for designers and developers. Big Data systems are becoming a central force in society and the use of models can also enable the development of intelligent systems providing Quality of Service (QoS) guarantees to their users through runtime system reconfiguration. This paper provides a new contribution in studying a novel modeling approach based on fluid Petri nets to predict MapReduce and Spark applications execution time which is suitable for runtime performance prediction. Models have been validated by an extensive experimental campaign performed at CINECA, the Italian supercomputing center, and on the Microsoft Azure HDInsight data platform. Results have shown that the achieved accuracy is around 9.5% for Map Reduce and about 10% for Spark of the actual measurements on average.

Stage Aware Performance Modeling of DAG Based in Memory Analytic Platforms

Conference Paper

Jun 2016

Making sense of performance in data analytics frameworks

Article

Jan 2015

Fine-Grained Dynamic Resource Allocation for Big-Data Applications

Abstract and Figures

Recommended publications

MSCA Postdoctoral Fellowships Master Class 2022

Building developmental local government to fight poverty: Institutional change in the city of Johann...

Optimal control simulation of the Deutsch-Jozsa algorithm in a two-dimensional double well coupled t...

Landscape connectivity in soil erosion research: Concepts, implication and quantification

Inner Force Analysis of Two Typical Frames with Vertical Displacement

Application of a Modified Dynamic Compression System Model to a Low-Aspect-Ratio Fan: Effects of Inl...