Conference PaperPDF Available

Analysis and Performance Evaluation of Deep Learning on Big Data

August 2019

August 2019

DOI:10.1109/ISCC47284.2019.8969762

Conference: IEEE Symposium on Computers and Communications
At: Barcelona

Authors:

Kassiano Jose Matteussi

Universidade Federal do Rio Grande do Sul

Breno Zanchetta

Universidade Federal do Rio Grande do Sul

Germano Bertoncello

Universidade Federal do Rio Grande do Sul

Jobe Diego Dylbas dos Santos

Universidade Federal do Rio Grande do Sul

Show all 6 authorsHide

Deep Learning (DL) and Big Data (BD) have converged to a hybrid computing paradigm that merges the dynamic processing in DL models with the computational power of the distributed processing of the BD frameworks. In this context, this work aims to conduct an analysis and performance evaluation of DL applications in BD. The experiments evaluate how the application training completion time can be related to the model's precision loss and the impacts of distributed computing in DL models. The experiments were performed in Microsoft Azure using BigDL framework, which allows using both Spark and TensorFlow on top of a Yarn cluster. The outcomes revealed a speedup of up to 8x and accuracy higher than 95%.

Content uploaded by Kassiano Jose Matteussi

Content may be subject to copyright.

Analysis and Performance Evaluation of

Deep Learning on Big Data

Kassiano J. Matteussi, Breno F. Zanchetta, Germano Bertoncello, Jobe D. D. Dos Santos

Julio C. S. Dos Anjos, Claudio F. R. Geyer

Institute of Informatics

Federal University of Rio Grande do Sul

Porto Alegre, Brazil 91509–900

Email: {kjmatteussi, bfzanchetta, gbertoncello, jobe.dylbas, jcsanjos, geyer}@inf.ufrgs.br

Abstract—Deep Learning (DL) and Big Data (BD) have con-

verged to a hybrid computing paradigm that merges the dynamic

processing in DL models with the computational power of the

distributed processing of the BD frameworks. In this context, this

work aims to conduct an analysis and performance evaluation

of DL applications in BD. The experiments evaluate how the

application training completion time can be related to the model’s

precision loss and the impacts of distributed computing in DL

models. The experiments were performed in Microsoft Azure

using BigDL framework, which allows using both Spark and

TensorFlow on top of a Yarn cluster. The outcomes revealed a

speedup of up to 8x and accuracy higher than 95%.

Keywords—Deep Learning, Distributed Deep Learning, Big

Data, BigDL, Apache Spark, Parallel Processing

I. INTRODUCTION

The current trends in Data Science contain two main ﬁelds:

Big Data (BD) and Deep Learning (DL) - a branch of Artiﬁcial

Intelligence (AI). The ﬁrst attracts many organizations that

seek to obtain value from complex datasets to improve the

business process through data processing and analysis [1]–[4];

The second allows machines to create their own conceptual

knowledge over the data with the use of ML algorithms. In

a more complex concept, it solves problems that are intuitive

for humans, but very difﬁcult to describe in objective terms to

a machine [5].

DL is used in many areas such as image processing (e.g.

facial recognition) [6], [7], object detection [8], robotics [9],

recurrent networks (e.g. autonomous translation) [10]–[12] and

acoustic processing [13], [14]. DL solutions development is

very complex in large-scale and can conduct to errors in

tasks such as data acquisition and partitioning; allocation

and management of resources; workload balancing; model

deployment; fault tolerance and so on [15].

In addition, recent works [16]–[23] investigate the relation

between BD stack and DL beneﬁts. The studies show frame-

works which the main goal is to accelerate the application

training time by exploring speciﬁc libraries such as MLIB, and

the performance of BD applications (Sort, Terasort, Wordcount

and so on) whilst using DL features. Also, the union of several

Artiﬁcial Intelligence (AI) solutions in a combined distributed

execution framework may lead to failures in attending dynamic

response and real-time decision-making requirements. Thus,

these facts demand attention towards the development of better

framework architectures [15]. Therefore, we identify a lack

of comprehensive studies to analyze the impact of distributed

processing in DL scenarios, as well as in the performance

evaluation of structures, models and datasets in deep networks

using local and distributed scenarios.

In this context, this work aims to conduct an analysis

and performance evaluation of DL applications in BD. The

evaluation will measure the impact of distributed computing

of DL models in terms of application training completion time

and model precision. The experiments represent a controlled

real-world test scenario in the Microsoft Azure cloud. The

BigDL framework [18] is instanced to allow to use both Spark

and TensorFlow engines on top of a Yarn cluster.

The main contributions of this paper are: i) to provide a

detailed performance analysis of DL applications distributed

in BD environment; ii) to evaluate the maintenance of models

precision towards distributed training and optimization.

This paper is structured as follows. Section II indicates the

main ﬁelds studied in this work. Section III presents the closer

works. The case of study and methodology of this work are

shown in Section IV. The results are present in Section V and

ﬁnally, the conclusion is in Section VI.

II. BACKGROU ND

Big Data (BD) refers to the increase in data volume (which

is hard to process and store in traditional systems), the

data variety (the indistinct data nature), velocity (considerable

computational efforts) to add value with the needed veracity

to the transformation of raw data into valuable and reliable

information [24]. The literature review presents relevant con-

tributions for BD scenarios in several contexts (cloud, energy,

programming models, software, algorithms, security and so

on) [19]. Meanwhile, Machine Learning (ML) introduced a

series of real-world solutions with very complex autonomous

operations.

However, ML still relied too much on human interaction

in the readjustment of the algorithm, specially in scenario

where the problem is intuitive for humans but very difﬁcult

to describe in objective terms to machines, thus contradicting

the point of task automation. Then emerged DL as a branch

of ML which allows the machines to create their conceptual

knowledge on the data and combine them into other more

complex self-generated ones [5].

Although BD enables fast data processing, it is important to

consider emerging problems that now are integrated to the ML

and DL models. For instance, there are complex data access

patterns, fast information retrieving, data classiﬁcation and

semantic indexing. Useful tools can be mentioned in different

areas, such as a set of frameworks based on the MapReduce

programming model in BD processing (e.g. Spark, Hadoop

Yarn and Flink which allows the ML library (MLlib) use).

Second, with the focus in DL, it can be mentioned the main

Neural Networks (NN), such as Feed-Forward Neural Net-

work(FNN), which is composed of computational units based

on Multi-Layer Perceptron (MLP) united into an acyclic graph;

Convolutional Neural Network (CNN) unites convolution op-

erations with FNN’s feature extraction, thus improving image

and video processing; Recurrent Neural Network (RNN) uses

time sequences and internal states to process data sequences,

thus providing great results in text and audio processing;

LSTM(Long-Short Term Memory) is a RNN’s particularity,

that introduces low-term and long-term signals to solve RNN’s

long dependencies issues.

Also, when these NNs are implemented in DL frameworks,

such as TensorFlow, Theano, PyTorch and Caffe, they are

capable of being applied towards BD’s feature learning [25].

Next section shows the recent state of the art on Big Data DL

Analysis. The primary goal is to provide a general vision of the

works that intend to merge both paradigms whilst highlighting

the open problems.

The DL framework is wrapped into BG engines by the

use of specif classes of them and the models of DL are

spread between multiple instances of worker nodes. Thus,

each worker receives a copy of the model and a small batch

of the dataset from the driver node. On follow, the workers

conduct fast and non-precise computations (training) that must

be synchronized and sent to the driver node, meanwhile the

worker nodes stay idle waiting for the next job. After that,

when the master node ﬁnishes the average of the models and

their optimization, the cycle starts again with new batches.

III. REL ATED WORK

In the work DeepSpark [16], the authors proposed a dis-

tributed and parallel DL framework that evaluates the use

of Apache Spark in GPGPU-based commodity clusters. This

approach distributes the workloads and adjusts some param-

eters automatically in order to set test scenarios to deploy

Caffe’s models into Spark framework. The author evaluates the

turnaround time per training, considering 1000 iterations and

presents a speedup analysis. The results revealed that training

time reduced up to 42% using 16 executors and achieved 0.6

of accuracy. Our work uses more epochs in the training of

models, hence achieves better results for the accuracy.

In SparkNet [17], Moritz et al. analyzes a framework for

deep networks training in Spark. The implementation uses a

simple parallelization scheme for Stochastic Gradient Descent

to handle high-latency during communication. The authors

evaluated the speedup of Caffe performed in Spark framework

on top of a GPGPU-based cluster. The goal was to present the

required time to reach an accuracy of 40%. The work evaluated

the precision loss in DL model by the analysis of standalone

and distributed scenarios.

Intel’s BigDL [18] implements a distributed DL framework

for BD with mini-batch synchronous optimization and a

centralized parameter server for model training. BigDL can

incorporate DL tools, such as TensorFlow, Caffe, Keras and

PyTorch. Regarding BD, BigDL is implemented coupled with

Spark. It allows integration in local mode with Apache Storm,

Apache Flink and Apache Kafka due to JVM support. Also,

a key aspect of BigDL is the quantization of 32-bit ﬂoating

point into 8-bit integer, which does not signiﬁcantly affect the

models accuracy, but reduces its size by up to four times and

results in a twofold distributed training speedup.

The author Gupta et al., in [19], shows a framework that

incorporates Apache Spark (using MLIB) and the advanced

ML architecture of a deep MLP, plus cascade learning concept.

The authors evaluated two real-world datasets to measure

the difference from Spark (without modiﬁcation) experiments

against the proposed Framework. The F1 Score and Accuracy

metrics that were identiﬁed and analyzed were not so expres-

sive, but indicated the feasibility to use CPU for this type of

processing. Our work also used CPU, but the main difference

is the evaluation with varied datasets and models.

DLoBD [22] introduces a rich experimental methodology

for DL models and datasets over BD in four study cases, while

also evaluates how the combination of these architectures

work. Even though it evaluates many models and frameworks,

few combinations between frameworks converge in the same

model, therefore its comparisons can be untrustworthy for

inter-modular comparisons. However, the work conducted a

ﬁne-grained analysis of each model and is the most related

work to ours.

In summary, to the extent of our knowledge, we observe

a lack of analysis of the impact of distributed processing in

DL scenarios (for instance, what happens with the accuracy of

the models when they are processed in a distributed fashion)

and of performance evaluation of neural networks, models and

datasets using local and distributed scenarios.

IV. PROPOSED APPROACH

The development and optimization of models, applications,

and frameworks into distributed environments put factors such

as performance and scalability as a challenge. Even though the

state-of-the-art presents studies that demonstrate performance

evaluation in distributed DL, but they do not show clearly

what is the impact that the distributed processing may cause

to DL applications. This work aims to conduct an analysis and

performance evaluation of DL on BD, keeping in mind these

previous considerations. The proposed evaluation measures the

impact of distributed computing of the DL models in terms of

application completion time and model precision loss.

A. BigDL Framework

The proposed evaluation uses the BigDL framework because

it offers a data representation that is common to the DL

frameworks, and is also compatible with BD. To support the

parallel and distributed processing, BigDL is deployed on top

of Apache Spark framework in an optimized way.

Even though BigDL performs the data-parallel training

with synchronous optimizers like TensorFlowOnSpark [20], it

also outperforms that framework, which incorporates Spark’s

datasets preparation in earlier stages of execution and by-

passes the remainder of Spark’s functions, turning to self-

implemented libraries. BigDL [18] focuses to extract the full

extent of Spark’s functionality by using the orchestration

layer to allocate cluster resources, then putting drivers nodes

to schedule and dispatch jobs to workers, which perform

computations and physically store data.

BigDL is different from other synchronous distributed tools

because it reduces life spawn of Spark jobs and can grant

more resilience towards resource alterations (for example,

errors and failures, resource sharing, preemption) along the

time. Also, it can be incorporated to a tool, named Drizzle,

for task scheduling in very large-scale environments, which

agglutinates large amounts of tasks for schedule and therefore

it reduces overheads.

Finally, BigDL softens negative effects in accuracy in dis-

tributed DL due to Model Quantization for the dimensionality

reduction (e.g. moving 2D convolution in NNs to a low-

precision representation). Hence, 32-bit ﬂoating point set of

parameters turn into 8-bit integer, thus granting up to fourfold

model size reduction and twofold inference speedup with

models accuracy drop of less than 0.1% [18].

Table I shows the combination of neural networks, models

and datasets into BigDL, which was chosen for our analysis.

TABLE I

THE BI GDL’SSTAC K

Network Model Dataset

CNN ResNet CIFAR10

CNN VGG CIFAR10

CNN LeNet MNIST

CNN Text Classiﬁer (CNN) 20 NewsGroup/GloVe-6B

RNN Text Classiﬁer (LSTM) 20 NewsGroup/GloVe-6B

RNN Tree-LSTM Stanford Sentiment Treebank

The datasets used in this work are presented as follows:

•CIFAR-10: It is a set of RGB images separated in 10

distinct classes. This set is sub-divided in 50,000 images

for training and 10,000 for tests, where each image is a

32x32 pixel RGB matrix;

•MNIST: It is composed of 70,000 images of 28x28 bilevel

pixel with centralized hand-written digits. This dataset is

classiﬁed into 10 distinct classes, with 60,000 images for

training and 10,000 images for tests;

•20NewsGroup: It is a collection with 20,000 documents

of discussion groups separated in 20 different topics;

•Stanford Sentiment Treebank: Film Criticisms collected

on Rotten Tomatoes and published by [26]. It consists of

about 12,000 sentences and 215,000 unique phrases.

B. Environment

1) Software Stack: The software stack used in this work is

described in Table II.

TABLE II

SOF TWARE STAC K

Operating System Ubuntu 16.04.5 LTS

Apache Spark v2.2.0

Hadoop YARN v2.6.3

Hadoop Distributed File System (HDFS)

Java OpenJDK 8 181

Scala 2.11.8

Python 2.7.15 (NumPy v1.15.4, Six v1.11.0)

BigDL version 0.7.1.

2) Infrastructure: BigDL is CPU optimized through Math

Kernel Library (MKL) for Intel XeonTMprocessor-based clus-

ters. For instance, the work [27] revealed that BigDL delivers

up to 3.83 times speedup on Xeon cluster compared to a GPU

cluster. For this reason, the environment used for experiments

execution is a Microsoft Azure CPU cluster, with node’s

settings shown in Table III.

TABLE III

NOD ES HAR DWARE SPECIFICATIONS

Component Speciﬁcation

Processor Intel XeonTME5-2673 4 Core (2,4-3,1 GHz)

Memory 30 GB RAM

Storage SSD 1TB

The driver and worker nodes belong to Microsoft’s D12v2

hyper-threaded general purpose instances. Details about the

clusters conﬁguration can be observed in Table IV.

TABLE IV

CLU STE R CON FIGU RATI ON

# Worker Nodes # Cores Memory(GB) Storage(TB)

4 worker Nodes 16 120 4

8 worker Nodes 32 240 8

12 worker Nodes 48 360 12

3) Parameters Conﬁguration: The main Spark parameters

adjusted for the experiments performed on top of YARN are:

•Master: it determines the Spark execution mode and can

be set as local or in the Yarn Kernel;

•Driver-memory: deﬁnes allocated memory for the driver;

•Executors-num: sets the number of executors;

•Executor-cores: indicates the core’s number by executor;

•Executor-memory: deﬁnes allocated memory by executor;

•Class: sets the applications entry point.

C. Methodology

In this work, we evaluated the performance by a speedup

analysis and measured the impact in TensorFlow models accu-

racy when using parallel and distributed processing paradigm

provided by Spark in BigDL through a short scalability

scenario. The metrics evaluated in this work are time (s) and

accuracy loss (%). The model and datasets test scenarios can

be described as a combination of 5 models with 4 datasets

into 6 distinct possible tests. Moreover, we evaluated the local

(standalone) environment and more 3 different cluster settings,

coming to a total of 6 test scenarios in 4 cluster conﬁgurations,

thus resulting in 24 distinct execution sets.

Also, even though Jain’s experimental methodology [28]

deﬁnes the ideal number of repetitions as 30 executions for

each conﬁguration, we reduced the tests to 10 repetitions

because our set of results presented low standard deviation.

Still, with this threefold reductions, these experiments ensure

a conﬁdence interval of 95%, with a total of 240 experiments.

V. EVAL UATION

A. Performance Analysis

In the next ﬁgures the axis-y represents the execution time

measured in seconds and axis-x shows the number of nodes

in each experiment.

1000

2000

3000

4000

5000

6000

1 4 8 12

Number of nodes

Execution time (s)

Model

ResNet

VGG

Fig. 1. Comparison of CNN models: ResNet and VGG on CIFAR-10 Dataset

100

200

300

400

500

1 4 8 12

Number of nodes

Execution time (s)

Fig. 2. CNN, LeNet Model - MNIST Dataset

250

500

750

1000

1250

1500

1750

2000

1 4 8 12

Number of nodes

Execution time (s)

Fig. 3. CNN, Text Classiﬁer Model - 20NewsGroup/GloVe-6B Dataset

Figure 1 shows the ResNet and VGG performances when

executed on CIFAR-10 dataset. The VGG model reduced its

training time by 84.6% with 12 worker nodes compared to

baseline, while ResNet model decreased up to 86.1%.

Even though the performance gain is expressive for both

models, ResNet outperforms VGG by a rate of 20% in the

standalone scenario, and by 28% with 12 workers. These

values correlate to a BigDL behaviour that tends to beneﬁt

deeper convolutional models as opposed to models with deeper

feature extraction units in a distributed training.

In other words, the synchronous distributed training process

in BigDL could be less suitable for models with too many neu-

rons in the Fully Connected layer (VGG has twice compared to

ResNet), it occurs due to massive increase in model parameters

length. ResNet is also penalized on the distributed training,

but unlike VGG, it is due to high costs of computations on

many convolutional layers (increases elevenfold compared to

VGG). Still, both scenarios may suffer convolutional and linear

quantization for optimization purposes, and perhaps, ResNet

residual blocks leverage parallelism during multi core training

executions more than VGG’s sequential convolutions, thus

reducing training idle time.

In Figure 2 we present the performance of LeNet model

using MNIST dataset. In comparison with standalone scenario,

LeNet with 12 workers has 74.5% reduction in training time.

It is possible to observe in Figure 2 that the reduction in

training time of LeNet also behaves nearly as a decreasing ex-

ponential curve. Still, even though the execution time was not

reduced as much as in other models, the parallelism ensured

performance gains, which can be veriﬁed in Table V. LeNet

was the model with higher accuracy in all conﬁgurations to

suffer among the lesser accuracy variations in the experiments.

Figure 3 presents the performance of Text Classiﬁer (CNN

model) on the 20NewsGroup/GloVe-6B dataset. CNN model

reduces the training time up to 86.5% with 12 worker nodes

compared to baseline.

In this case, it seems clear how parallelism supports ap-

plication performance with minimal penalties. The accuracy

penalty from distributed optimization and batch distribution is

a low price due to greater execution time reduction, which can

be put towards more steps of training to mitigate this trade-

off. Also, the workload distribution possibly reduces resource

usage (CPU, memory and other) avoiding interference and

ensuring high data throughput.

Figure 4 shows the performance of Text Classiﬁer (LSTM

model) using 20NewsGroup/GloVe-6B dataset. The processing

time decreases nearly to an exponential behavior until it

reaches 87.4% reduction with 12 workers against the baseline.

Figure 5 shows the execution time of the Tree-LSTM

model using Stanford Sentiment Treebank dataset. The model

obtained a reduction of 44.5 % in training time with 12

worker nodes. In addition, we noted that this model did

not show signiﬁcant performance gains among the distributed

runs, as seen in Table V. We believe that this effect occurs

because LSTM networks require more epochs than regular

CNNs to achieve plausible accuracy, thus the efforts to ﬁxate

a common number of epochs for all experiments -to grant

fairness- generated this disastrous output for LSTM.

500

1000

1500

2000

2500

3000

3500

4000

4500

1 4 8 12

Number of nodes

Exececution time (s)

Fig. 4. RNN, Text Classiﬁer Model - 20NewsGroup/GloVe-6B Dataset

100

150

200

250

300

1 4 8 12

Number of nodes

Execution time (s)

Fig. 5. RNN, Tree-LSTM Model - Stanford Sentiment Treebank Dataset

Complementary, Table V shows the obtained speedup and

efﬁciency, to provide further data about the same experiments.

TABLE V

SPE EDU P AN D EFFICI ENC Y

Speedup Efﬁciency

Model 4 8 12 Local 4 8 12

ResNet 3.24 5.08 7.19 1 0.81 0.64 0.60

VGG 2.29 4.43 6.49 1 0.57 0.55 0.54

LeNet 1.43 2.77 3.93 1 0.36 0.35 0.33

TC-CNN 2.65 5.03 7.40 1 0.66 0.63 0.62

TC-LSTM 2.31 4.42 7.94 1 0.58 0.55 0.66

TreeLSTM 1.64 1.61 1.80 1 0.41 0.20 0.15

The obtained speedups indicating a high performance pro-

vided in the parallelism for DL models. Also, the results

present that workload is a threshold for performance gain, i.e.,

the performance can be optimized if an ideal equation or model

is designed taking into account the number of worker nodes,

computational resources, and the datasets. The ResNet model,

for example, does not provide any gain when the distribution

is 8 to 12 worker nodes, indicating an unnecessary level of

parallelism. Another example is TC-LSTM model that always

present beneﬁts as the parallelism grows, meaning an adequate

approach.

B. Distributed Deep Learning: a Precision Analysis

Even though BigDL shows expressive feasibility in dis-

tributed DL computing, as can be seen previously, the proper-

ties of DL models (e.g. accuracy) still need to be guaranteed.

Thus, Table VI compares the precision when DL is distributed

along 12 nodes. The Local label stands for average model

precision for a standalone cluster. The node numbers 4, 8 and

12 represent the precision distributed among the nodes and

the label variation (Variation) is a measurement between local

and 12 nodes. BigDL’s precision is available in logs, as an

accuracy measure in percentage during validation steps.

TABLE VI

PRECISION OF DL MODE LS (%)

Model Dataset Local 4 8 12 Variation

ResNet CIFAR10 84.14 83.53 82.28 80.124 -4.77

VGG CIFAR10 67,90 60,65 54,12 45,91 -32.38

LeNet MNIST 98.50 98.16 97.27 95.64 -2.9

TC (CNN) 20NewsGroup

GloVe-6B

85.64 85.63 84.50 84.22 -1.66

TC (LSTM) 20NewsGroup

GloVe-6B

29.46 31.32 24.20 18.31 -37.85

Tree-LSTM Stanford

Sentiment

Treebank

46.17 46.56 47.07 45.45 -1.56

It is possible to observe that most models preserved their

precision with low variation, which ensures BigDL’s optimiza-

tions for distributed training. This behavior can be analyzed

with the following observations:

•The precision loss for ResNet model was: 0.72%, 1.49%,

2.62% for 4, 8 and 12 worker nodes, respectively. On

total, the mean precision loss was 4.77%;

•The precision loss for LeNet model was: 0.34%, 1.23%,

1.67% from 4, 8 and 12 worker nodes, respectively. The

total precision loss was just of 2.9%;

•Tree-LSTM showed successive precision gains of 0.84%,

1.1% and 3.44% from 4, 8 and 12 worker nodes, respec-

tively. Overall, the mean precision loss on Tree-LSTM

training was 1.56%, and the reasons for this behaviour

will be explained later;

•The precision loss for Text Classiﬁer (CNN) model was:

0.01%, 1.32%, 0.33% from 4, 8 and 12 worker nodes,

respectively. On total, precision drop was only 1.6%.

The measure of VGG model, on the other hand, revealed a

mean precision loss of 10.67%, 10.76%, 15.16% for 4, 8 and

12 nodes, respectively. Overall, the causes behind this 32.38%

loss can be explained as follows: Overall, this model came

to a loss of 32.38% between the standalone and 12 workers

conﬁguration. We believe that the causes can be explained

as follows: The sequential convolutions increase and a high

number of fully connected neurons generate an increase of

training parameters, which even applied with quantization

led to a complexity increase in processing, storing and data

transferring. We also noticed that BigDL’s implementation of

VGG contains multiple instances of Dropout between each

convolutional layer -possibly to speed convergence up-. This

ﬁnding could explain why VGG’s accuracy is so disappointing,

whilst its high number of parameters contribute to an increase

in training time. In conclusion, VGG needs more training

epochs in order to achieve acceptable precision, and is not

efﬁcient as it is.

Another model that lost accuracy is Text Classiﬁer with

LSTM, which according to VI presented a gain of 6.31%

and a loss between 22.02% and 24.34% for 4, 8 e 12 nodes.

The distributed processing caused an average precision loss

of 37.85%. Even though the 4 nodes conﬁguration introduced

6.31% accuracy gain, the absolute values are so close that

this phenomenon could be explained due to better dataset

shufﬂing and batching on 4 nodes conﬁguration, i.e. blind

luck. On the other hand, the accuracy drops in the 8 and 12

nodes conﬁguration happens because LSTMs rely on long and

short dependencies for training. Thus, these networks suffer

more penalties when receiving smaller batches, thus forming

inconclusive dependencies. They need many more training

epochs than CNNs to achieve acceptable precision rates. Tree-

LSTM obtained nearly 45% of accuracy on its tests, which

reveals that this model must need more epochs to achieve

higher gains in accuracy.

VI. CONCLUSION

In this context, this work aimed to conduct an analysis and

performance evaluation of DL applications in BD. Thus, this

was provided in a Microsoft Azure cluster, where computation

was performed by the Apache Spark on top of Hadoop Yarn.

Our evaluation measured the impact caused in TensorFlow DL

applications when processed in a parallel and distributed fash-

ion. Outcomes presented feasibility of distributed processing,

with a speedup of up to ˜

8x and loss in accuracy that was less

than 5% in the best case. In the future, we intend to replicate

the performed experiments using TensorFlowOnSpark and

SparkNet frameworks. In addition, we intend to add more

models and datasets, including ImageNet. Finally, we desire to

develop this study to discover the ideal training distribution for

each study case, that is, to determine the ideal conﬁgurations

for each model.

VII. ACKN OWLEDGMENT

The authors thank the following Brazilian Agencies for

supporting this work: FAPERGS Project ”GREEN-CLOUD

- Computac¸˜

ao em Cloud com Computac¸˜

ao Sustent´

avel”

(#16/2551-0000 488-9), ”SmartSent” (#17/2551-0001 195-3),

CAPES (Finance Code 001), CNPq and PROPESQ-UFRGS-

Brasil.

REFERENCES

[1] K. J. Matteussi, M. G. Xavier, C. A. F. de Rose, and C. F. R.

Geyer, “Understanding and minimizing disk contention effects for data-

intensive processing in virtualized systems,” in Proc. 16th International

Conference on High Performance Computing and Simulation (HPCS),

2018, p. 901–908.

[2] J. C. S. dos Anjos, K. J. Matteussi, P. R. de Souza Jr, A. da Silva Veith,

G. Fedak, J. L. V. Barbosa, and C. F. R. Geyer, “Enabling Strategies for

Big Data Analytics in Hybrid Infrastructures,” ser. HPCS - International

Conference on High Performance Computing and Simulation. IEEE

Computer Society, July 2018, p. 869–876.

[3] P. R. R. de Souza, K. J. Matteussi, J. C. S. dos Anjos, J. D. D. dos

Santos, C. F. R. Geyer, and A. da Silva Veith, “Aten: A dispatcher for

big data applications in heterogeneous systems,” in Proc. International

Conference on High Performance Computing Simulation (HPCS), July

2018, pp. 585–592.

[4] S. M. Srinivasan, T. Truong-Huu, and M. Gurusamy, “Flexible band-

width allocation for big data transfer with deadline constraints,” in

Proc. IEEE Symposium on Computers and Communications (ISCC), July

2017, pp. 347–352.

[5] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.

MIT Press, 2016.

[6] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing

the gap to human-level performance in face veriﬁcation,” in Proc. IEEE

Conference on Computer Vision and Pattern Recognition, 2014, pp.

1701–1708.

[7] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recogni-

tion: A convolutional neural-network approach,” IEEE Transactions on

Neural Networks, vol. 8, no. 1, pp. 98–113, 1997.

[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”

in Proc. IEEE Conference on Computer Vision and Pattern Recognition,

2015, pp. 1–9.

[9] J. O. Gaya, L. T. Gonc¸alves, A. C. Duarte, B. Zanchetta, P. Drews, and

S. S. Botelho, “Vision-based obstacle avoidance using deep learning,” in

Proc. 13th Latin American Robotics Symposium, 4th Brazilian Robotics

Symposium (LARS/SBR). IEEE, 2016, pp. 7–12.

[10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by

jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,

2014.

[11] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning

with neural networks,” in Proc. 27th Advances in Neural Information

Processing Systems (NIPS), 2014, pp. 3104–3112.

[12] I. D. Falco, G. D. Pietro, G. Sannino, U. Scafuri, E. Tarantino, A. D.

Cioppa, and G. A. Trunﬁo, “Deep neural network hyper-parameter

setting for classiﬁcation of obstructive sleep apnea episodes,” in Proc.

IEEE Symposium on Computers and Communications (ISCC), June

2018, pp. 01 187–01 192.

[13] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with

deep recurrent neural networks,” in Proc. IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–

6649.

[14] T. H. Vu and J.-C. Wang, “Acoustic scene and event recognition

using recurrent neural networks,” Proc. Workshop on Detection and

Classiﬁcation of Acoustic Scenes and Events (DCASE), 2016.

[15] R. Nishihara, P. Moritz, S. Wang, A. Tumanov, W. Paul, J. Schleier-

Smith, R. Liaw, M. Niknami, M. I. Jordan, and I. Stoica, “Real-time

machine learning: The missing pieces,” in Proceedings of the 16th

Workshop on Hot Topics in Operating Systems. ACM, 2017, pp. 106–

110.

[16] H. Kim, J. Park, J. Jang, and S. Yoon, “Deepspark: A spark-based

distributed deep learning framework for commodity clusters,” arXiv

preprint arXiv:1602.08191, 2016.

[17] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan, “Sparknet: Training

deep networks in spark,” arXiv preprint arXiv:1511.06051, 2015.

[18] Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, Y. Wan, Z. Li,

J. Wang, S. Huang et al., “Bigdl: A distributed deep learning framework

for big data,” arXiv preprint arXiv:1804.05839, 2018.

[19] A. Gupta, H. K. Thakur, R. Shrivastava, P. Kumar, and S. Nag, “A big

data analysis framework using apache spark and deep learning,” in Proc.

IEEE International Conference on Data Mining Workshops (ICDMW),

2017, pp. 9–16.

[20] L. Yang, J. Shi, B. Chern, and A. Feng, “Open Sourcing

TensorFlowOnSpark: Distributed Deep Learning on Big-Data

Clusters.” [Online]. Available: http://yahoohadoop.tumblr.com/post/

157196317141/open-sourcing- tensorﬂowonspark-distributed-deep

[21] A. Feng, J. Shi, and M. Jain, “CaffeOnSpark Open Sourced for

Distributed Deep Learning on Big Data Clusters,” accessed on:

01/28/2019. [Online]. Available: http://yahoohadoop.tumblr.com/post/

139916563586/caffeonspark-open-sourced-for-distributed- deep

[22] X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, “Dlobd:

A comprehensive study of deep learning over big data stacks on HPC

clusters,” IEEE Transactions on Multi-Scale Computing Systems, 2018.

[23] H. Y. Ahn, H. Kim, and W. You, “Performance study of distributed

big data analysis in yarn cluster,” in Proc. International Conference

on Information and Communication Technology Convergence (ICTC).

IEEE, 2018, pp. 1261–1266.

[24] I. A. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani, and S. U.

Khan, “The rise of “big data” on cloud computing: Review and open

research issues,” Journal of Information Systems, vol. 47, pp. 98–115,

2015.

[25] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for

big data,” Journal of Information Fusion, vol. 42, pp. 146–157, 2018.

[26] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships

for sentiment categorization with respect to rating scales,” in Proc.

43rd Annual Meeting on Association for Computational Linguistics.

Association for Computational Linguistics, 2005, pp. 115–124.

[27] J. Dai, X. Liu, and Z. Wang, “Building large-scale image feature

extraction with bigdl at jd. com,” Intel, Tech. Rep., Oct. 2017.

[28] R. Jain, The art of computer systems performance analysis: techniques

for experimental design, measurement, simulation, and modeling. John

Wiley & Sons, Inc., 1990.

MFLCES: Multi-Level Federated Edge Learning Algorithm Based on Client and Edge Server Selection

Article

Full-text available

Jun 2023

This research suggests a multi-level federated edge learning algorithm by leveraging the advantages of Edge Computing Paradigm. Model aggregation is partially moved from a cloud center server to edge servers in this framework, and edge servers are connected hierarchically depending on where they are located and how much computational power they have. At the same time, we considered an important issue: the heterogeneity of different client computing resources (such as device processor computing power) and server communication channels (which may be limited by geography or device). For this situation, a client and edge server selection algorithm (CESA) based on a greedy algorithm is proposed in this paper. Given resource constraints, CESA aims to select as many clients and edge servers as possible to participate in the model computation in order to improve the accuracy of the model. The simulation results show that, when the number of clients is high, the multi-level federated edge learning algorithm can shorten the model training time and improve efficiency compared to the traditional federated learning algorithm. Meanwhile, the CESA is able to aggregate more clients for training in the same amount of time compared to the baseline algorithm, improving model training accuracy.

An Algorithm to Minimize Energy Consumption and Elapsed Time for IoT Workloads in a Hybrid Architecture

Article

Full-text available

Apr 2021
SENSORS-BASEL

Advances in communication technologies have made the interaction of small devices, such as smartphones, wearables, and sensors, scattered on the Internet, bringing a whole new set of complex applications with ever greater task processing needs. These Internet of things (IoT) devices run on batteries with strict energy restrictions. They tend to offload task processing to remote servers, usually to cloud computing (CC) in datacenters geographically located away from the IoT device. In such a context, this work proposes a dynamic cost model to minimize energy consumption and task processing time for IoT scenarios in mobile edge computing environments. Our approach allows for a detailed cost model, with an algorithm called TEMS that considers energy, time consumed during processing, the cost of data transmission, and energy in idle devices. The task scheduling chooses among cloud or mobile edge computing (MEC) server or local IoT devices to achieve better execution time with lower cost. The simulated environment evaluation saved up to 51.6% energy consumption and improved task completion time up to 86.6%.

TEMS: An Algorithm to Minimize Energy Consumption and Elapsed Time for IoT Workloads in a Hybrid Architecture

Preprint

Full-text available

Mar 2021

Advances in communication technologies have made the interaction of small devices, such as smartphones, wearables, and sensors, scattered on the Internet, bringing a whole new set of complex applications with ever greater task processing needs. These Internet of Things (IoT) devices run on batteries with strict energy restrictions. They tend to offload task processing to remote servers, usually to Cloud Computing (CC) in datacenters geographically located away from the IoT device. In such a context, this work proposes a dynamic cost model to minimize energy consumption and task processing time for IoT scenarios in Mobile Edge Computing environments. Our approach allows for a detailed cost model, with an algorithm called TEMS that considers energy, time consumed during processing, the cost of data transmission, and energy in idle devices. The task scheduling chooses among Cloud or Mobile Edge Computing (MEC) server or local IoT devices to better execution time with lower cost. The simulated environment evaluation saved up to 51.6% energy consumption and improved task completion time up to 86.6%.

Boosting Big Data Streaming Applications in Clouds With BurstFlow

Article

Full-text available

Jan 2020

The rapid growth of stream applications in financial markets, health care, education, social media, and sensor networks represents a remarkable milestone for data processing and analytic in recent years, leading to new challenges to handle Big Data in real-time. Traditionally, a single cloud infrastructure often holds the deployment of Stream Processing applications because it has extensive and adaptative virtual computing resources. Hence, data sources send data from distant and different locations of the cloud infrastructure, increasing the application latency. The cloud infrastructure may be geographically distributed and it requires to run a set of frameworks to handle communication. These frameworks often comprise a Message Queue System and a Stream Processing Framework. The frameworks explore Multi-Cloud deploying each service in a different cloud and communication via high latency network links. This creates challenges to meet real-time application requirements because the data streams have different and unpredictable latencies forcing cloud providers’ communication systems to adjust to the environment changes continually. Previous works explore static micro-batch demonstrating its potential to overcome communication issues. This paper introduces BurstFlow, a tool for enhancing communication across data sources located at the edges of the Internet and Big Data Stream Processing applications located in cloud infrastructures. BurstFlow introduces a strategy for adjusting the micro-batch sizes dynamically according to the time required for communication and computation. BurstFlow also presents an adaptive data partition policy for distributing incoming streams across available machines by considering memory and CPU capacities. The experiments use a real-world multi-cloud deployment showing that BurstFlow can reduce the execution time up to 77% when compared to the state-of-the-art solutions, improving CPU efficiency by up to 49%.

A Survey on Collaborative Learning for Intelligent Autonomous Systems

Article

Sep 2023

This survey examines approaches to promote Collaborative Learning in distributed systems for emergent Intelligent Autonomous Systems (IAS). The study involves a literature review of Intelligent Autonomous Systems based on Collaborative Learning, analyzing aspects in four dimensions: computing environment, performance concerns, system management, and privacy concerns, mapping the significant requirements of systems to the emerging Artificial intelligence models. Furthermore, the paper explores Collaborative Learning Taxonomy for IAS to demonstrate the correlation between IoT, Big Data, and Human-in-the-Loop. Several technological open issues exist in the aforementioned domains (such as in applications of autonomous driving, robotics in healthcare, cyber security, and others) to effectively achieve the future deployment of Intelligent Autonomous Systems (IAS). This Survey aims to organize concepts around IAS, indicating the approaches used to extract knowledge from data in Collaborative Learning for IAS, and identifying open issues. Moreover, it presents a guide to overcoming the existing challenges in decision-making mechanisms with IAS, providing a holistic vision of Big Data and Human-in-the-Loop.

Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines

Article

Full-text available

Jun 2022
SENSORS-BASEL

A significant rise in the adoption of streaming applications has changed the decision- making processes in the last decade. This movement has led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, and others. Spark Streaming, a widespread open-source implementation, processes data-intensive applications that often require large amounts of memory. However, Spark Unified Memory Manager cannot properly manage sudden or intensive data surges and their related in- memory caching needs, resulting in performance and throughput degradation, high latency, a large number of garbage collection operations, out-of-memory issues, and data loss. This work presents a comprehensive performance evaluation of Spark Streaming backpressure to investigate the hypothesis that it could support data-intensive pipelines under specific pressure requirements. The results reveal that backpressure is suitable only for small and medium pipelines for stateless and stateful applications. Furthermore, it points out the Spark Streaming limitations that lead to in-memory-based issues for data-intensive pipelines and stateful applications. In addition, the work indicates potential solutions.

Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines

Preprint

Full-text available

May 2022

In the past decades, a significant rise in the adoption of streaming applications has changed the decision-making process for the industry and academia sectors. This movement led to the emergence of a plurality of Big Data technologies such as Apache Storm, Spark, Heron, Samza, Flink, and other systems to provide in-memory processing for real-time Big Data analysis at high throughput. Spark Streaming represents one of the most popular open-source implementations which handles an ever-increasing data ingestion and processing by using the Unified Memory Manager to manage memory occupancy between storage and processing regions dynamically, which is the focus of this study. The problem behind memory management for data-intensive stream processing pipelines is that the incoming data is faster than the downstream operators can consume. Consequently, the backpressure of Spark acts in the opposite direction of downstream operators. In such a case, the incoming data overwhelms the memory manager and provokes memory leak issues. As a result, it affects the performance of applications generating, e.g., high latency, low throughput, or even data loss. In such a case, the initial intuition motivating our work is that memory management became the critical factor in keeping processing at scale and system stability of Spark. This work provides a deep dive into Spark backpressure, evaluates its structure, presents the main characteristics to support data-intensive streaming pipelines, and investigates the current in-memory-based performance issues.

A Dynamic Cost Model to Minimize Energy Consumption and Processing Time for IoT Tasks in a Mobile Edge Computing Environment

Chapter

Dec 2020

The rapid growth of IoT devices and applications with data-intensive processing has led to energy consumption and latency concerns. These applications tend to offload task processing to remote Data Centers in the Cloud, distant from end-users, increasing communication latency and energy costs. In such a context, this work proposes a dynamic cost model to minimize energy consumption and total elapsed time for IoT devices in Mobile Edge Computing environments. The solution presents a Time and Energy Minimization Scheduler (TEMS) that executes the cost model, validated through simulation, which resulted in a reduction in energy consumption by up to 51.61% and in task completion time by up to 86.65%.

Deep Neural Network Hyper-Parameter Setting for Classification of Obstructive Sleep Apnea Episodes

Conference Paper

Full-text available

Jun 2018

Aten: A Dispatcher for Big Data Applications in Heterogeneous Systems

Conference Paper

Full-text available

Jul 2018

Stream Processing Engines (SPEs) have to support high data ingestion to ensure the quality and efficiency for the end-user or a system administrator. The data flow processed by SPE fluctuates over time, and requires real-time or near real-time resource pool adjustments (network, memory, CPU and other). This scenario leads to the problem known as skewed data production caused by the non-uniform incoming flow at specific points on the environment, resulting in slow down of applications caused by network bottlenecks and inefficient load balance. This work proposes Aten as a solution to overcome unbalanced data flows processed by Big Data Stream applications in heterogeneous systems. Aten manages data aggregation and data streams within message queues, assuming different algorithms as strategies to partition data flow over all the available computational resources. The paper presents preliminary results indicating that is possible to maximize the throughput and also provide low latency levels for SPEs.

Understanding and Minimizing Disk Contention Effects for Data-Intensive Processing in Virtualized Systems. Kassiano Matteussi,

Conference Paper

Full-text available

Apr 2018

Distributed computing systems (e.g., clouds) have been widely employed to support an expanding range of applications. As the scale of data generation grows in regards to volume, velocity and variety (3Vs of big data), data-intensive processing became essential to extract valuable information from complex datasets. In this scenario, the infrastructure needs to meet the scaling demand of applications and must use resource management techniques to avoid interference problems. Literature review mainly focuses on CPU and memory solutions to handle resource contention problems in data-intensive processing. Complementarily, this paper further analyses and proposes techniques to minimizes disk contention effects in order to improve application performance in virtualized systems - technology that drives the cloud computing environment. For this objective, we present a general-purpose resource management strategy that adjusts dynamically disk I/O utilization rates. Results showed that the proposed approach improves application’s performance by up to 26%.

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Conference Paper

Full-text available

Nov 2017

BigDL: A Distributed Deep Learning Framework for Big Data

Conference Paper

Nov 2019

ThispaperpresentsBigDL (adistributeddeeplearning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and "war stories" of users that havead-optedBigDLtoaddresstheirchallenges(i.e., howtoeasilybuildend-to-enddataanalysisanddeep learning pipelines for their production data).

Performance Study of Distributed Big Data Analysis in YARN Cluster

Conference Paper

Oct 2018

Enabling Strategies for Big Data Analytics in Hybrid Infrastructures

Conference Paper

Jul 2018

A survey on deep learning for big data

Article

Jul 2018
INFORM FUSION

Deep learning, as one of the most currently remarkable machine learning techniques, has achieved great success in many applications such as image analysis, speech recognition and text understanding. It uses supervised and unsupervised strategies to learn multi-level representations and features in hierarchical architectures for the tasks of classification and pattern recognition. Recent development in sensor networks and communication technologies has enabled the collection of big data. Although big data provides great opportunities for a broad of areas including e-commerce, industrial control and smart medical, it poses many challenging issues on data mining and information processing due to its characteristics of large volume, large variety, large velocity and large veracity. In the past few years, deep learning has played an important role in big data analytic solutions. In this paper, we review the emerging researches of deep learning models for big data feature learning. Furthermore, we point out the remaining challenges of big data deep learning and discuss the future topics.

DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters

Article

Jun 2018

underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">D eep L earning o ver B ig D ata (DLoBD) is an emerging paradigm to mine value from the massive amount of gathered data. Many Deep Learning frameworks, like Caffe, TensorFlow, etc., start running over Big Data stacks, such as Apache Hadoop and Spark. Even though a lot of activities are happening in the field, there is a lack of comprehensive studies on analyzing the impact of RDMA-capable networks and CPUs/GPUs on DLoBD stacks. To fill this gap, we propose a systematical characterization methodology and conduct extensive performance evaluations on four representative DLoBD stacks (i.e., CaffeOnSpark, TensorFlowOnSpark, MMLSpark/CNTKOnSpark, and BigDL) to expose the interesting trends regarding performance, scalability, accuracy, and resource utilization. Our observations show that RDMA-based design for DLoBD stacks can achieve up to 2.7x speedup compared to the IPoIB-based scheme. The RDMA scheme also scales better and utilizes resources more efficiently than IPoIB. For most cases, GPU-based schemes can outperform CPU-based designs, but we see that for LeNet on MNIST, CPU + MKL can achieve better performance than GPU and GPU + cuDNN on 16 nodes. Through our evaluation and an in-depth analysis on TensorFlowOnSpark, we find that there are large rooms to improve the designs of current-generation DLoBD stacks.

Sequence to Sequence Learning with Neural Networks

Conference Paper

Sep 2014

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Analysis and Performance Evaluation of Deep Learning on Big Data

Abstract

Recommended publications

A Pretreatment Workflow Scheduling Approach for Big Data Applications in Multi-cloud Environments

EdgeEye: An Edge Service Framework for Real-time Intelligent Video Analytics

Artificial Intelligence Platform for Mobile Service Computing

VIAF: Verification-Based Integrity Assurance Framework for MapReduce