ArticlePDF Available

Identifying Similarities of Big Data Projects–A Use Case Driven Approach

January 2020
IEEE Access 8:186599-186619

January 2020
8:186599-186619

DOI:10.1109/ACCESS.2020.3028127

License
CC BY 4.0

Authors:

Matthias Volk

Otto-von-Guericke-Universität Magdeburg

Daniel Staegemann

Otto-von-Guericke-Universität Magdeburg

Sascha Bosse

Otto-von-Guericke-Universität Magdeburg

Show all 5 authorsHide

Big data is considered as one of the most promising technological advancements in the last decades. Today it is used for a multitude of data intensive projects in various domains and also serves as the technical foundation for other recent trends in the computer science domain. However, the complexity of its implementation and utilization renders its adoption a sophisticated endeavor. For this reason, it is not surprising that potential users are often overwhelmed and tend to rely on existing guidelines and best practices to successfully realize and monitor their projects. A valuable source of knowledge are use case descriptions, of which a multitude exists, each of them with a varying information density. In this design science research endeavor, 43 use cases are identified by conducting a thorough literature review in combination with the application and adaption of a corresponding template for big data projects. By a subsequent categorization, which is performed by identifying and employing a hierarchical clustering algorithm, nine different standard use cases emerge, as the contribution's artifact. This provides decision-makers with an initial entry point, which can be utilized to shape their project ideas, not only by identifying the general meaningfulness of their potential big data project but also in terms of concrete implementation details.

The use case analysis as a BPMN model.

…

Figures - uploaded by Matthias Volk

Content may be subject to copyright.

Content uploaded by Matthias Volk

Content may be subject to copyright.

Received September 14, 2020, accepted September 24, 2020, date of publication October 1, 2020, date of current version October 22, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.3028127

Identifying Similarities of Big Data Projects—

A Use Case Driven Approach

MATTHIAS VOLK , (Graduate Student Member, IEEE), DANIEL STAEGEMANN,

IVAYLA TRIFONOVA, SASCHA BOSSE, AND KLAUS TUROWSKI

Faculty of Computer Science, Otto von Guericke University Magdeburg, 39106 Magdeburg, Germany

Corresponding author: Matthias Volk (matthias.volk@ovgu.de)

ABSTRACT Big data is considered as one of the most promising technological advancements in the last

decades. Today it is used for a multitude of data intensive projects in various domains and also serves as

the technical foundation for other recent trends in the computer science domain. However, the complexity

of its implementation and utilization renders its adoption a sophisticated endeavor. For this reason, it is not

surprising that potential users are often overwhelmed and tend to rely on existing guidelines and best practices

to successfully realize and monitor their projects. A valuable source of knowledge are use case descriptions,

of which a multitude exists, each of them with a varying information density. In this design science research

endeavor, 43 use cases are identiﬁed by conducting a thorough literature review in combination with the

application and adaption of a corresponding template for big data projects. By a subsequent categorization,

which is performed by identifying and employing a hierarchical clustering algorithm, nine different standard

use cases emerge, as the contribution’s artifact. This provides decision-makers with an initial entry point,

which can be utilized to shape their project ideas, not only by identifying the general meaningfulness of their

potential big data project but also in terms of concrete implementation details.

INDEX TERMS Big data, use case analysis, clustering, categorization, literature review, design science

research.

I. INTRODUCTION

Due to the ever-growing amount of data produced and cap-

tured by humanity [1], [2], the ability to analyze and subse-

quently use the contained information has gained a widely

acknowledged signiﬁcance in today’s society [3]. While the

usage of data, in general, is no new concept, the prevailing

‘‘data deluge’’ poses new challenges that overstrain tradi-

tional technologies and demand new solutions [4], leading

to the term ‘‘big data’’. Even though there is no uniﬁed

deﬁnition of the term, the approach by the National Insti-

tute of Standards and Technology (NIST) belongs to the

most common ones. It states that ‘‘Big Data consists of

extensive datasets – primarily in the characteristics of vol-

ume, velocity, variety, and/or variability – that require a

scalable architecture for efﬁcient storage, manipulation, and

analysis’’ [5]. These data characteristics describe the nature

of the data to be processed and, thus, are often recognized

as crucial inﬂuence factors when it comes to the planning

The associate editor coordinating the review of this manuscript and

approving it for publication was Fanbiao Li .

and realization of big data projects. The volume, for instance,

describes the amount of data to be stored, managed, and

processed [5]. In turn, the variety of the data either represents

the heterogeneity of the structure [6], [7], but sometimes

also the origin [5]. The same differentiated view exists for

the velocity that either addresses the speed with which the

data is incoming or the time for its processing [5], [7].

The variability refers to changes of the dataset, for instance,

in terms of the structure, rate of the data ﬂow or the size

of the data [5]. Together, those characteristics are covered

under the umbrella term of the fours Vs of big data, with each

depicting the abbreviation of one characteristic. Besides those

known core characteristics, many others emerged in recent

years, such as the value of the data or the veracity [5], [8].

To handle the additional challenges, posed by those Vs,

compared to traditional workloads, numerous new strategies,

tools, and systems have been introduced in recent years.

Nowadays, those technologies and concepts are applied to

a wide array of domains, comprising, but not being limited

to, mobility [9], smart cities [10], media distribution [11],

healthcare [12], sports [13], education [14] and business [15].

VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 186599

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

Furthermore, also the positive effects of the utilization of big

data technologies have been shown [16], [17], emphasizing

the importance of such efforts.

However, implementing and operating those systems is a

sophisticated endeavor with many possible pitfalls [18], [19],

which could diminish the beneﬁts or even result in nega-

tive outcomes [20]. By considering the shortage of qualiﬁed

experts in this domain and the concurrent demand [21], [22],

independent from the actual size of the enterprise, it appears

to be reasonable to support concerned decision-makers and

technicians. This applies especially to fundamental tasks

like the design of the underlying technical architecture [23].

Hence, a thorough description of corresponding use cases

could facilitate the realization of those kinds of complex

projects by providing a suitable source of information. How-

ever, the biggest deﬁciency of most of these is the level

of detail. Although every year, numerous contributions are

published, describing the general endeavor, they often omit

information regarding the actual project, the occurred prob-

lems, and the speciﬁc implementation. Furthermore, in some

cases, similar methods and paradigms are applied or even

technological considerations made, leading sometimes to a

sole distinction in terms of the main scope. This reduces

transparency for those who want to utilize such cases to

obtain information for similar application scenarios, in terms

of the general meaningfulness, best practices, concrete tech-

nologies, or even speciﬁc architecture details. Consequently,

today a multitude of case studies exists, potentially offering

valuable guidance for the realization of big data projects,

but their exploitation still remains a challenge. To uncover

inherent relations between those and thus overcome the out-

lined barriers, the following research question (RQ) shall be

answered in the course of this work:

Which standard big data use cases are revealed, applying

cluster analysis to the corresponding case studies, found in

the literature?

Answering this question and structuring a collection of

successful big data projects, accordingly, could constitute

a valuable resource for the instantiation of future projects

providing a general orientation and guidance but also con-

crete implementation details for speciﬁc scenarios. Due to the

presumed similarity of the cases in parts, the formulation of

standard use cases for the structuring of a potential collec-

tion appears to be a suitable solution. As a result, decision-

makers will obtain the opportunity to quickly identify cases,

which are similar to the challenges they are facing and thus,

lean on the existing knowledge base. It also enables them

to classify their projects and identify the needed expertise

in a more systematic and therefore more meaningful way.

Furthermore, those categories could be used as a foundation

for a big data decision support system. This allows for a

comparison of the expected usage scenario with familiar

exemplary cases and the incorporation of gained experiences

in a technology-supported way [24].

Since the ﬁrst introduction of the most famous big

data technology – Hadoop – in 2005 [25] the maturity

tremendously increased. To take this development into

account, only use cases published between 2015 and

2018 were incorperated, as cases from the period before have

already been sufﬁciently investigated by other authors, such

as in [26]. Due to the start of this research in 2019 and the

decision to cover only literature of completed years, the end

date 2018 was chosen. Furthermore, to reduce the complexity,

misinterpretations, and effort for further modiﬁcations and

extensions, additional document analysis techniques were

applied. While the general approach resembles the work of

Ylijoki and Porras [26], using case study analysis and cluster-

ing, the speciﬁc methodology, the analyzed time frame, and

the respective objectives differ. Apart from that, the publi-

cation at hand especially addresses researchers and scientists

concerned with the project feasibility, technology selection as

well as implementation and application details in a big data

context.

A. METHODOLOGY

To ﬁnd an answer for the initially formulated RQ, multi-

ple methodologies need to be applied in a combined way.

As a general foundation for the realization of this endeavor,

the design science research methodology according to

Hevner et al. [27] was used, providing an artifact as a solu-

tion to the formulated problem. In particular, similarities of

successful big data projects are investigated and standard

use cases are derived as the main artifact of this work.

To further improve the reproducibility and clarity of the

conducted measures, the six-stepped procedure according to

Peffers et al. [28] was followed to ensure, that the develop-

ment of the intended solution is systematically approached.

The ﬁrst step of this workﬂow focuses on the brief motivation

and description of the problem. Subsequently, the main objec-

tives are highlighted, which is directly followed by design

and development. As a transition, the theoretical foundation

needs to be investigated and relevant material collected. Since

the main artifact of the research, in the form of standard use

cases, builds upon existent use cases, a structured literature

review [29], [30], as well as a use case analysis were con-

ducted, examining the content of each of the cases in-depth.

Often companies write case studies for advertisement aims,

for example, to win new customers or to present themselves

in social media. On the other hand, they prepare the case

studies as a documentation of their ‘‘best practices’’ [31].

Those describe the decisions made by companies, the reasons

for those decisions, their implementation, and the following

results [32]. Case studies can also be used as a guideline

for subsequent users, having a particular problem, or striving

for a concrete solution. Either way, to provide maxiumum

value, case studies must be written according to pre-deﬁned

standards [33]. Hence, it was presumed that those case stud-

ies describe the usage of big data technologies and the

related processes in their context. To ensure this, additionally,

the comprehensiveness of each of them was checked by using

a modiﬁed version of an existing use case template [34].

At this stage, also important features for a later clustering

186600 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

approach, implemented during the subsequent step, are iden-

tiﬁed. The following three steps concern the demonstration,

evaluation, and presentation of the artifact.

B. STRUCTURE

Based on the used methodology, the structure is as follows.

Within the ﬁrst section, an initial overview of the current situ-

ation, as well as the derived research question and main objec-

tives, are presented. Along with this, the used methodology as

well as the structure is introduced. In the subsequent second

section, the conducted structured literature review, resulting

in the collection of the regarded use cases, is described. The

third section thoroughly describes the performed clustering,

each of the clusters and their particularities, as well as the

actual development of the standard use cases. An evaluation

of the obtained results is presented in the fourth section.

In here, the standard use cases are tested on the base of

previously unseen data, at which a categorization of those

is pursued. Within the ﬁfth section, concluding remarks are

given. Apart from a summary, this comprises a discussion and

an outlook on future research.

II. THE STRUCTURED LITERATURE REVIEW

For the identiﬁcation of relevant big data use cases that have

been published between 2015 and 2018, a structured literature

review is performed. In particular, the methodologies accord-

ing to Levy and Ellis [30] as well as Webster and Watson [29]

were used. To verify the comprehensiveness of the found out

contributions, additionally, an existing use case template was

adopted, which is provided by the NIST [34]. Within the

following section, the review protocol, the used template as

well as the results are meticulously described [35].

A. REVIEW PROTOCOL

To obtain a broad overview of the entire domain, the focus

of the search was not set on a single database. Instead to

‘‘exhaust all sources that contain IS research publications’’

[30, p. 183] the scientiﬁc literature database Scopus was

used for the initial keyword search. Although it was noted

by different authors [29], [30] that multiple sources should

be queried, to receive an extensive overview of all of the

relevant articles, due to its comprehensiveness, only the men-

tioned database was used, since most of the widely accepted

literature databases and their relevant articles are listed here,

referring to the source. For these reasons, Scopus serves

more as a kind of a meta-database indexing relevant con-

tributions. Secondly, it was not required to perform alter-

ations on the queried terms and used operators. According

to the targeted domain of interest, the terms ‘‘case study’’,

‘‘use case’’ and ‘‘case description’’ were used in combi-

nation with ‘‘big data’’. Further, to reduce the number of

irrelevant search results, additional inclusion and exclusion

criteria were formulated. Some widely accepted ones are, for

instance, the used language, the publication in a conference,

journal or book, and relevance for answering the formu-

lated research question. Only when all of the aforementioned

inclusion criteria were met, the paper was accepted. However,

sometimes it was noted that some of the contributions did

not encompass as much information as needed. Due to this,

various criteria were formulated, like, if a use case was not

presented very well and did not contain the required key infor-

mation suitably, a paper was rejected. In turn, in a few of the

found out use cases, the information density was very high but

only focusing on the introduction, development, or evaluation

of new technologies. An additional exclusion criterion was

formulated for those cases. The complete collection of all

of the used inclusion and exclusion criteria is summarized

in Table 1. The initial material collection was performed

by applying the described keywords and some of the men-

tioned inclusion criteria directly through the advanced search

mechanics of the literature database. As a result, 2,379 non-

redundant publications were found. Following that, it was

required to check the papers on their actual usability. The

reﬁnement of the material was performed in a two-stepped

procedure. In the ﬁrst step, the title, abstract, and structure

were checked. This resulted in having 108 relevant contri-

butions. Within the second step, the actual content of the

remaining case studies was investigated. It was noticed that

in most of the cases the content differed strongly, in terms of

the information density.

TABLE 1. Inclusion and exclusion criteria of the structured literature

review.

Only a rough presentation was performed for most of the

use cases, neglecting important descriptive information, such

as about the data or the situation before. To simplify the

mapping of the needed information, the qualitative analy-

sis, and the evaluation of the comprehensiveness for each

use case, the use of a corresponding template was deemed

appropiate. As a basis, the very extensive template provided

by the NIST, covering eight different parts and 57 big data

project-related questions, was modiﬁed and used [34]. The

original template was designed by the NIST Big Data Public

Working Group (NBD-PWG) to collect existing use cases.

Due to its overarching purpose, the template as well as the

VOLUME 8, 2020 186601

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

respective categories were continuously validated regarding

their applicability and, thus, compared to the actual content

of the found out contributions.

B. USE CASE TEMPLATE

The basic template comprises a multitude of different aspects.

Apart from the general project description and the situation

before, the relevant big data characteristics, applied tech-

niques, and multiple other information are requested. Due

to the reason that not all of the template’s ﬁelds were from

major interest, modiﬁcations to the original version were

performed. After an initial scan, during the second step of the

reﬁnement, ﬁelds which were not related to the formulated

criteria (cf. Table 1) and not required for the general appli-

cability of big data-related projects were removed (R). This

includes, for instance, the last two parts and questions like

‘‘do you foresee any potential risks from public or private

open data projects?’’ or ‘‘under what conditions do you

give people access to your data?’’ [34]. At the same time,

additional points, like the veracity of the data or privacy-

related information, were newly added (N). All other ﬁelds,

meaningful for the found out contributions, were adapted.

An overview of the general content of each template category

is depicted in Table 2, whereas a complete depiction of all

of the made considerations is shown within the appendix

in Table 8. While the ﬁrst column describes the targeted

content of the ﬁeld, the latter is focusing on the performed

changes. Either the ﬁeld of the original use case template was

adapted (A), removed (R) or a new one was added (N).

TABLE 2. Overview of the use case template categories and their general

description (cf. [34]).

In total, 23 of the questions were removed, 27 adapted

without any changes, 7 modiﬁed, and 11 newly added. For

instance, within the ﬁrst part, only minor changes were made,

to prevent misleading interpretations during the investigation

of the use cases. This includes the renaming of the parts,

the deletion of one question, and the addition of the part

advantages of harnessing big data, which was covered by

almost all cases except for one [36]. In the subsequent second

part, additional characteristics were introduced or moved

from the adhering third part, due to the high coverage by

the use case description. In the following, every part was

evaluated regarding the usefulness for the intended compre-

hensiveness check. Consequently, smaller changes were con-

ducted, as they are depicted in the referred Table 8. Notewor-

thy are the last two parts, which were removed completely,

since almost none of the use cases hold information related

to those. While part seven deals with various workﬂow steps

and respective changes in the data characteristics, the last

one goes into more detail, when it comes to privacy and

security concerns. Although some papers, such as [33], [37],

[38], discussed various issues in detail, the relatively small

number of related cases made those parts not universally

applicable. Eventually, the ﬁnal template was used to check

the comprehensiveness of each found out use case.

C. RESULTS OF THE LITERATURE REVIEW

After the two-stepped reﬁnement procedure and the compre-

hensiveness check through the used template were ﬁnished,

40 different case studies from a keyword-based search in the

academic area remained. To broaden the overview, towards

the practitioner’s perspective, the same criteria, keywords,

and competency questions were used for a search procedure

on industrial case studies. After the keyword search proce-

dure, querying the Google Search Engine, 208 additional

cases were identiﬁed.

Through the subsequent observation of the mentioned cri-

teria and the use of the modiﬁed template, only three cases

remained. Those are from Lufthansa [39], Dell [33] and

Fujitsu [40]. One of the prevailing factors was the lack of

speciﬁc information. Instead, mostly advertisements were

presented to showcase the company’s competence. One pos-

sible reason for this might be the missing motivation to

present critical information, fearing the loss of competitive

advantages. An overview of all of the described steps is

depicted in Figure 1. In total, the material collection resulted

in 43 different cases, which were further used in the course

of this work. Although it was expected that an additional step

in the form of a forward-backward search [29] will increase

the number of promising cases, no new cases were identiﬁed.

Most of the cases were published in 2016 (seventeen case

studies), followed by 2017 (twelve case studies), 2018 (nine

case studies) and ﬁnally 2015 (ﬁve case studies). A general

investigation of the respective application domain reveals that

the chosen case studies are coming from various areas.

Almost 25 percent of all contributions originate from the

healthcare area. Another important area, with 38 percent,

186602 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

FIGURE 1. Conducted steps of the literature review, including the initial

keyword search and the subsequent refinement steps.

is the internet of things (IoT), aiming to realize concepts

like smart city, smart transportation, smart buildings, and

more. Considering the various application areas, the time

scope and the different databases from which the case studies

originate, it can be concluded that those are a representa-

tive sample of the existing projects, regarding the usage of

big data technologies. A complete listing of all of them is

depicted in Table 3. While most of the related databases

are directly addressed, others comprises the databases Taylor

and Francis,Gesellschaft fuer Informatik,ACM,IADIS Por-

tal,Scitepress, and the use cases originating from company

resources. Furthermore, for each paper, a distinct number

is assigned that will be used in the further context of this

work.

TABLE 3. The Results of the Literature Review Mapped to the Respective

Literature Databases.

III. USE CASE ANALYSIS

Prior, a quantitative overview was given and a mapping of

the selected case studies was performed. In the following,

those will be analyzed in a detailed way. Manual approaches

often result in great effort if the analysis has to be repeated or

extended in possible future work. Because of the high number

of found out results, the subjective observation that may come

with this kind of investigation on extensive documents, and to

increase the comprehensibility of this research, a more objec-

tive analytical approach was chosen. In particular, document

clustering was selected, not only to ﬁnd relevant information

but also as described before, to identify standard use cases.

Those shall faciliate decision support for practitioners and

researchers, willing to perform a big data-related project. The

creation of the intended solution, performed here, is equiva-

lent to the general design and development of the implicitly

followed design science research methodology.

A. SELECTION OF THE CLUSTERING ALGORITHM

To reduce the effort of analyzing the case studies, differ-

ent methods that are typically used for document cluster-

ing have been compared with each other in terms of their

applicability, as they are intensively investigated in various

contributions [78], [79]. In particular, partitioned, density-

based, and hierarchical types were investigated. While the

ﬁrst intends to put similar objects to the same cluster while

maintaining the space between the different clusters as high

as possible [80], density-based approaches follow the idea

to identify regions where a high density (cluster) exists

and separate them from those regions with a lower density

(noise) [81], [82]. The last approach, on the other hand, is

the building of a tree structure where each node, except for

the leaf nodes, is a cluster that contains its children as sub-

clusters (dendrogram) [78], [79], [82], [83]. Although all of

the mentioned algorithms come with multiple potentials and

beneﬁts, not all of them are applicable to the current problem.

Partitioned clustering, such as the K-means algorithm, for

instance, allows the use of multiple calculation methods for

cluster building [78]. However, some authors such as in [79]

critically observed this kind of algorithms, stating issues such

as the choice of the initial centroid, the strict predeﬁned

number of clusters, and further highlight the importance of

other approaches [83].

A density-based approach, such as DBSCAN, tends to

work best with noisy data and outliers, but it is not

robust against high-dimensional data [83]. In this particular

approach, high-dimensional data must be processed and no

outliers are present. This results predominantly out of the

meticulous qualitative assessment of the contributions during

the literature review. Eventually, the hierarchical clustering

algorithm was chosen as a suitable alternative, avoiding the

need for starting parameters, specifying the strict number or

size of the clusters [81]. Furthermore, also high-dimensional

data can be recognized. Typically this approach structures

a given dataset and ‘‘provide[s] a view of the data at

different levels of abstraction’’ [79]. One of the most

VOLUME 8, 2020 186603

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

frequently used approaches is constituted by the agglomer-

ative clustering that assigns each object to one cluster and

merges them until a whole tree is formed. The ﬁrst step

requires the calculation of a proximity matrix between the

objects. Following that, the two closest clusters, with the low-

est distance, are merged and the proximity matrix is updated

for the new cluster. The procedure is repeated until only one

cluster remains [79], [81], [83]. Compared to the K-means

and the DBSCAN algorithm, the hierarchical clustering does

not divide the points into end clusters, instead, a hierarchical

structure based on the unit distance is provided. In terms of

this, different results can be obtained through the alteration of

the used distance metrics and linkage functions [81]. Due to

the reason that all elements are connected to each other, out-

liers cannot be efﬁciently handled, but this was not required

for the given task. However, at the same time, qualitative

assessments can be realized through the manual deﬁnition of

a desired distance. In summary, the basic steps that are needed

for the clustering are the deﬁnition of the feature set for all use

cases, the creation of the input matrix, the examination of the

hierarchical clustering, the deﬁnition of the cluster structure

as well as the determination of the intercluster distance. After

everything is deﬁned, the clusters need to be reviewed and

modiﬁed in case they are not correctly assigned. By the end,

those will be deﬁned as standard use cases and thoroughly

described. To summarize all of the aforementioned steps

described before, an overview is depicted in Figure 2.

FIGURE 2. The use case analysis as a BPMN model.

B. CLUSTER ANALYSIS

As previously described, the standard use cases shall be

deduced from the found out clusters of the hierarchical clus-

tering, especially from a higher level of abstraction. In the

beginning, an input matrix needs to be deﬁned. Initially,

the cases were collected and qualitatively checked by using a

modiﬁed version of the NIST template, describing the current

situation (e.g. represented by the aim and data characteristics)

as well as the obtained solution (used methods and tech-

nologies). Although the template formed a promising starting

point for the description of the feature matrix, it was not

possible to use it as a direct input for the clustering algorithm.

This is not only due to some unnecessary descriptive ﬁelds,

such as title, author, or the rough description of the use

case, but also for needed information that can be manifoldly

expressed like the variety of the data or the used algorithms.

After an additional examination of the ﬁlled templates, a total

of 30 binary features were identiﬁed. For the construction of

the input matrix, it was required to check each use case on the

base of a formulated feature set. Since the number of use cases

did not differentiate too much, compared to the number of

attributes, this task was manually performed. An overview of

all features together with the respective numbers of the index

and the occurrences is depicted in Table 4.

As one may note, some of the listed features were

more frequently identiﬁed compared to the others. In a

descending order this includes: Dynamic Data, Data Fusion,

Unstructured Data, Heterogeneous Data, Statistical Calcula-

tions, Multiple Sources, Big Data Analysis, Real-time Data,

Hadoop, and Batch Processing. The complete mapping of

all features and the respective use cases is given in Table 5.

While one column represents one feature, each row stands

for one use case. All of the features are related to the relevant

data characteristics, used methods, ﬁelds of application, and

also applied technologies. In terms of the data characteristics,

for instance, it was needed to clarify whether the data is

coming from various types of sources and if it should be

shared between different users or applications during the

operation phase. Following that, the type of ﬁle system was

checked, in particular, if a distributed ﬁle system (e.g. HDFS),

wide-area ﬁle system (e.g. Lustre) or parallel ﬁle system

was used. If a particular use case (row) fulﬁlled one of the

formulated features (column), a ﬁlled dot (•) was noted,

whereas for not related features, an empty dot (◦) was used.

Since most of the required functionalities of the intended

algorithm are already included within Matlab, this compu-

tational framework was used for the cluster analysis. The

created table was transformed into a binary matrix, ﬁlled

with ones and zeros, transposed, and eventually used for

the input. Besides the actual data, only a few inputs were

required in Matlab. This includes the used distance mea-

sure and linkage function [84]. While the ﬁrst computes

the distance between each pair of observations, the second

uses the calculated distance between the observations and

links them according to the type of function that is cho-

sen [81]. In particular, the Euclidian distance and Ward’s

186604 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

TABLE 4. A list of all features and their occurrences.

linkage function were used, resulting in the dendrogram

depicted in Figure 3.

On the x-axis, it shows each of the previously listed use

cases. The y-axis describes the distance between the various

use cases and aggregated clusters. While most of the existing

clustering methods require a strict number of clusters or the

elements contained in them, the used approach requires only

minimum and maximum values for both, ranging from two

to n (as the number of cases). During the investigation and

formulation of the features, huge differences between the

cases were noticed in parts, which directly inﬂuenced this

range selection. By having too many clusters that differ only

slightly from each other, a decrease in the usability for later

standard cases can be expected. This would diminish the

general idea of standard use cases, especially when it comes

to the classiﬁcation of a planned project for a potential user.

Apart from the time-consuming planning steps needed in

beforehand, also detailed knowledge about speciﬁc features

would be required to make further distinctions. This, in turn,

would neglect the general sensibility and applicability of the

targeted outcome. For that reason, the inter-cluster distance

together with the number of agglomerated clusters was exam-

ined to understand at which point all cases were assigned.

As one can note in the depicted diagram, at a distance of

two, only six clusters consisting out of multiple cases are

built, whereas 31 cases remain as one separate cluster. At the

level of 3.5, only one case remained unassigned and in total

13 clusters were formed. A distance of 4 resulted in seven

distinct clusters, which comprise all of the 43 use cases. At a

level of 4.5, only ﬁve agglomerated clusters exist. By having

the aforementioned disadvantage of too few cases in mind,

the achieved seven clusters at a distance of 4 were chosen.

A further qualitative assessment and cross-checking of each

of these seven clusters, however, revealed the disadvantage of

the non-weighed Euclidian distance function.

C. EXAMINATION OF THE BUILT CLUSTERS

Due to the reason that only binary decisions on the feature

set are recognized, no in-depth information extraction and

connections were made yet. As one can note, some of the use

cases revealed a rather high inter-cluster distance. For those

reasons, and to obtain a better understanding of the similari-

ties of those cases, further examinations were required. First

and foremost this includes the overall aim of the use case and

the interplay of the features. In the following, each of those

and their speciﬁcs are described in detail. Table 6 provides an

overview of the automatically built clusters and their assigned

use cases at a distance of 4.

1) DESCRIPTION OF CLUSTER 1

The ﬁrst automatically built cluster comprises the seven use

cases no. 2[41], 3[42], 5[38], 6[71], 8[48], 37 [61] and

39 [63]. One particularity of the cases located in this cluster

that was noticed at the very beginning of the examination and

comparison, was the aim to improve already existing analysis

processes. On one side this comprises the automation of cur-

rently manually performed actions, like the ﬁnancial rumor

detection [71], the monitoring of patients at their homes,

instead of using the hospital capacities [63], or the analysis

of trafﬁc sensors data to recognize congestions and trafﬁc

patterns [48]. On the other side, the improvement of existing

processes can be deﬁned as achieving better quality of the

performed analysis. This can be realized by using all of the

available information, like unstructured medical and genome

data on the way to personalized medicine [63], investigating

the meaning of social media data to improve crisis mapping

systems [41] or the broadening of the scope [61].

All of those cases make use of unstructured data, coming

from various sources in different formats. Especially if those

have to be realized in real-time, sophisticated approaches

are required. This is for example the case for the autom-

atized ﬁnancial rumor detection, considering more than

300 trades per day [71], or the simulation of wind turbine

VOLUME 8, 2020 186605

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

TABLE 5. Resulting matrix showing highlighting the occurrence of a feature (F) in a specific use case (U).

186606 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

FIGURE 3. Dendrogram of the cluster analysis, showing the cluster distance (y-axis) and use case number (x-axis).

TABLE 6. Automatically built clusters of the hierarchical clustering.

conﬁgurations resulting in a data stream of 100MBytes per

hour [61]. Furthermore, dynamic and permanent (historic)

data are used in all of the cases. The same hold true to

the application of statistical methods, like market activity

statistics [71] or the possibility of a medical condition to

occur in different situations [42], [63]. Moreover, the search,

query, and indexing of the data, used in the cases, aiming to

improve existing processes, should be enabled. The searching

process of the available data is an important part of the

analysis. To make the analysis of big data easier, the data can

be classiﬁed in different categories.

The consensus between all the cases is that they rely on

deep learning techniques to reach their goals. Those tech-

niques are for example needed to map ﬁnancial news to

a trained set to decide the authenticity of the news [71].

In the medical sphere, machine learning is used to uncover

similarities between patients and thus to faciliate the proper

therapy prediction, or to enable the remote patient monitoring

by making sense of the medical records and pre-deﬁned

rules [42], [63]. In [61] deep learning is used to analyze

different conﬁgurations of wind turbines, in order to decide

about their optimal location and design.

The technologies used on the way for the optimization

of existing processes differ from one another, as all of

those cases have their speciﬁc subtasks. Some of the cases

make use of the HDFS to deal with the volume of the

data [38], [48], [63], [71]. Those analyze some of the data

in batch mode to create a trained set, which is later used as

a basis for the real-time analysis and the decision-making.

Furthermore, in one of the cases studies, for the purpose of

reducing the dimensionality of the data, the parallel ﬁle sys-

tem Lustre is used [61], allowing the simulation of different

turbine conﬁgurations. For the implementation of the crisis

and mapping system [41], the ElasticSearch database is used,

as this one enables the near real-time processing of the data.

To enable remote patient monitoring and sensemaking of all

the collected sensor data, in [42], the analytics engine Spark

is used to speed up the query performance.

2) DESCRIPTION OF CLUSTER 2

The second cluster comprises the three use cases no. 1[36],

9[49], and 13 [43]. Each of them focuses on value creation

through the analysis of IoT sensor data. The data itself origi-

nates in each case from various sources. In [36], for instance,

smart meters, different appliances, and smart home devices

are used. Diverse weather sensors that measure temperature,

wind speed, and humidity are sourced in [49]. In the case

of [43], gas or imaging sensors that can typically be found in

recycling systems are focused. Apart from the behavior of the

occupants of smart buildings [36], or anomalies and failure

VOLUME 8, 2020 186607

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

detection of the weather sensors [49], also statistics over

the recycled goods and their usage to improve the recycling

system are targeted [43].

In each of the cases, different kinds of sensors are used.

Big data technologies are needed in all of them to cope

with the unstructured, large amounts of data. In particular,

comprehensive analysis and clustering algorithms are applied

to uncover yet unknown patterns, as they were described

before. For this reason, special pre-processing measurements

like data cleaning, which removes outliers and irrelevant

data, and measurements to structure the data have to be

applied [36], [49]. To receive the desired insights, detecting

the previously hidden patterns, a real-time data processing

is not necessarily needed. For example, for the detection of

failures in the weather sensors in [49], the data is ﬁrstly

collected and bulk-loaded into the IoT framework. Afterward,

the data is analyzed to identify similar values and potential

failures. The analysis of the data in the recycling system

is realized after the data from the different sensors were

collected and pre-processed [43]. The main aim of all use

cases of this cluster is to uncover some relevant patterns out of

the huge amount of IoT data. To reach this target, the gathered

data needs to be classiﬁed and subsequently analyzed with the

help of suitable algorithms. In the case of [36], the K-means

algorithm is used to discover the hourly usage of appliances

and their usage on different weekdays by the occupants of a

building. This algorithm is also used to discover patterns out

of the values, delivered from different weather sensors [49].

Another important part of the analysis process is the usage

of basic statistics. When the analysis is performed, the visu-

alization of the results is mandatory. For instance, this is used

to represent the hourly usage of different household devices

from the occupants of a building [36] or to show clusters,

built on the basis of the different weather sensor values [49].

As this group tries to realize the concept of IoT in different

areas of life, the sharing of information between users or

devices should be enabled. For example, the occupant of a

smart home should be able to exchange information with

his household devices like dishwasher or oven [36]. Another

example of such information exchange can be observed in

the smart recycling system, where retailers and consumers

should be also able to communicate [43]. The various uncov-

ered patterns can be used, for instance, to generate energy

reduction recommendations, to avoid failures in weather sen-

sors, and to increase the efﬁciency of the current recycling

systems [36], [43], [49].

3) DESCRIPTION OF CLUSTER 3

Based on the set of given features and their occurrences in the

respective use cases, no. 17 [54], 19 [44], 21 [66], 22 [67],

25 [58] and 30 [59] were automatically assigned to the third

cluster. Again, a thorough investigation and comparison was

made to identify conspicuous similarities. Three of the cases

are aiming to realize smart city concepts [44], [59], [66]. The

strong connection of those can solely be observed accord-

ing to the low inter-cluster distance in Figure 3, but also

regarding their overall scope. For example, in [66] the concept

of an itinerary planning platform for tourists, which suggests

activities according to pre-deﬁned criteria, such as location,

time, and period preferences is proposed. In the case of [44],

unstructured data is used to improve transportation. In doing

so, information about incidents on the highway shared by

tweets on Twitter, videos of a disaster, or pictures of a

trafﬁc jam are utilized. A touristic recommender system is

shown in [59]. The personalized recommendations are not

only based on permanent data, such as information about

city infrastructure, existing restaurants, or hotels, but also

dynamic data that is constantly changing. The latter is not

only related to the velocity of the data but also its structure.

These data include, for instance, information coming from

wearable bracelets, social networks, and used sensor data

applied for the trafﬁc and weather tracking [44], [59], [66].

Eventually, all of those cases consider data coming from

different sources. Hence, the data fusion plays a dominant

role in all of those cases.

In this context, geographic information systems (GIS) are

used to gather, store and analyze the whole geographic infor-

mation such as the location of users, trafﬁc jams, incidents,

disasters, hotels, restaurants or attractions [44], [59], [66].

For the actual provisioning of personalized recommenda-

tions, real-time data analyzes are required. In [44] those are

realized through the use of Spark streaming. Furthermore,

in each of those, sophisticated statistical methods are applied.

As the steps of data gathering, processing, and analysis are

conducted, the results need to be represented understandably

and appealingly for the user. Therefore, the usage of various

visualization techniques is crucial for this group. For instance,

the personal touristic recommendations for activities, accom-

modations, and restaurants in [59] are shown. Another exam-

ple of visual techniques is the representation of the current

transportation situation for a selected region in video or image

format [44].

However, this cluster contains three further cases that have

not been considered yet [54], [58], [67]. One of them rep-

resents a system for the remote 24/7 patient monitoring,

which can be also used to determine the future medication

procedure [54]. The second one represents a framework that

utilizes the data originating from the ﬁnancial sector and

various IoT devices to improve the user experience [58].

Although connections to the previously described cluster

were identiﬁed, distinctions in terms of the data processing

and the format were observed. In particular, real-time and

batch processing on differently structured data is performed.

The last case study makes use of high-performance comput-

ing (HPC) to analyze and optimize the operation of wind

turbines [67]. This one also deviates from the smart city

concept, represented in the ﬁrst three cases. Despite the fact

that those are serving in here as outliners, all of them share

other characteristics, such as the processing speed, used data

formats, and the overall aim to improve existing methods.

Due to those similarities, they could serve as a cluster or

rather a group themselves.

186608 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

4) DESCRIPTION OF CLUSTER 4

Cluster four contains the cases no. 4[47], 14 [51], 16 [53],

23 [56], 33 [70], 34 [75] and 35 [60]. During the observation

and further examination of each of those, a similar distinc-

tion as in the previous cluster was observed. In particular,

two sub-groups were identiﬁed, since they do not share one

sub-cluster but instead multiple characteristics.

The ﬁrst sub-group comprises the cases originating

from [47], [56], [60]. They have in common that the main

focus is put on the realization of smart city concepts. The

ﬁrst case in this group attempts to integrate information from

sensors and IoT devices used in a building, weather informa-

tion sensors as well as data from environmental sensors in

a cognitive building framework [47]. This framework shall

improve energy consumption by learning from the behavior

of the habitant and adjusting the functionality of the devices

accordingly to the users’ behavior [47]. The second case

introduces a smart trafﬁc pilot, making use of the trafﬁc light

data, weather and disaster information, as well as GPS data

about the positions of the vehicles. This information can be

used in different applications like route optimization or a

driving coach, suggesting fuel-saving driving patterns [60].

The last case in this group proposes a platform that integrates

data from IoT devices, GIS, and energy-related information

to improve the energy consumption and to reduce the CO2

emissions [56]. Each of those three concepts corresponds to

the general aim of realizing the smart city concept.

Notwithstanding that, all of them make use of data, coming

from various devices like household appliances, environmen-

tal sensors, smart meters, or GPS devices [47], [56], [60]. This

heterogeneous data is mostly unstructured and can have dif-

ferent content formats, in the form of texts, images, or videos.

Furthermore, personally identiﬁable information about the

habits of the buildings’ occupants, GPS, and passengers in

a car are used [47], [56], [60]. The personally identiﬁable

information together with the input about the city and build-

ings’ infrastructure make up the permanent data used in this

group. Moreover, the real-time processing or at least the near-

real-time processing of the data should be enabled by the

proposed smart city concept. For example, GPS data and

environmental information should be processed on the ﬂy so

that a plausible route recommendation can be delivered [60].

Further, the near-real-time integration of data, coming from

IoT devices and various networks (e.g. electrical and heating

networks), is a requirement for the energy management sys-

tems, proposed in the last case in this group [56].

Additionally to those aspects, it was found out that all cases

use GISs to locate the user, particular vehicles, or relevant

objects. To represent the delivered smart city solution, the

cases deploy different visualization techniques. For example,

the analyzed driving behavior of the user and the derived fuel

economy recommendations can be represented in a mobile

application [60]. This solution was also used in [56] to visu-

alize the energy consumption data and the possible improve-

ments that can be conducted. To derive the above-mentioned

recommendations, various statistical methods are used.

Those methods are needed to calculate values such as the

energy consumption in particular rooms, average fuel con-

sumption on a road segment, or corresponding indicators for

certain time frames [47], [60].

The remaining cases are no. 14 [51], 16 [53], 33 [70] and

34 [75]. Those have different application areas and neither ﬁt

in the smart city concept nor can build up a separate group.

The ﬁrst one presents a smart clinical workﬂow implementa-

tion that should automate some parts of the patient care [51].

This one could ﬁt to the smart city concept, but the developed

solution does neither make use of visualization techniques

nor a GIS and thus does not ﬁt in the already built subgroup.

The second one [70] differs strongly regarding the aim to

integrate several bioinformatics databases and, thus, improve

the scalability of the cancer analytical system. The next case

study has a similar aim to the previous one. Here, the query

performance of a library information system needed to be

improved [53]. However, it has a different pattern of the ful-

ﬁlled feature set, including only structured data from only one

source. The last case study considers event-manufacturing

data and aims to improve existing processes [75]. This one

does not maintain personally identiﬁable information and

does not use GIS, which is a crucial requirement of the above-

deﬁned smart city subgroup.

5) DESCRIPTION OF CLUSTER 5

The next cluster, derived from the hierarchical clustering

results, consists of three cases, namely no. 15 [52], 28 [74]

and 29 [46]. The general aim of those cases can be described

as the integration of data from different sources, improving

the scalability and leading to better analysis results. For

example, the ﬁrst case study proposes a system for real-time

trafﬁc control [52]. In comparison to the existing trafﬁc con-

trol systems, the proposed solution should be able to consider

more than one aspect by involving more data resources. The

second case study aims to improve the quality of the user

experience in the communications area by involving more

resources in the analysis. In the past, data mining methods

were used in the telecom area to ﬁgure out problems only

in an isolated way, for example fraud detection based on

call detail records. In order to consider different telecom-

munication aspects, nowadays various information, coming

from mobile networks, GPS devices and social media has to

be considered [74]. The target of the last case study in this

category is to turn regular factories into smart factories, where

resources and machines communicate and deliver smart prod-

ucts that are aware of their production history. To realize those

factories, the integration of data from different machines and

operators needs to be pursued [46].

The data used in this cluster can be both, structured and

unstructured. Examples for structured data can be records

of the log and machine operating times [46], [74]. In con-

trast, social media data, camera pictures, and sensor data

are examples of unstructured data [46], [74]. Furthermore,

all three case studies maintain personally identiﬁable infor-

mation, which can require special processing techniques.

VOLUME 8, 2020 186609

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

This information can be the location of a driver, smart card

data of public transport users, or the call history log of a

telecom company customer [52], [74]. Permanent as well as

transient data (data that is deleted by the end of a session) play

an important role in the performed analysis in this cluster.

Regarding the processing of the data, real-time analytics

are required. For instance, to enable real-time trafﬁc con-

trol, incident detection has to be performed on the ﬂy [52].

The same applies to the communication between prod-

ucts, resources, and machines to realize smart factory con-

cepts [46]. Besides that, in all of the cases, NoSQL databases

are used to enable the querying of the diverse information.

In the presented solutions, basic statistics are used to calculate

values such as average speed, travel time, subscriber churn

likelihood, and operational time of a machine [46], [52], [74].

Deep learning algorithms are used to reach the goals of the

analysis. Those enable, for example, the consideration of

user ufeedback in the analysis of the users experiences’s

quality [74].

6) DESCRIPTION OF CLUSTER 6

The sixth cluster includes the seven case no. 7[64], 10 [37],

11 [65], 26 [73], 27 [45], 42 [40] and 43 [33]. When inspect-

ing the cases, two different groups were found out.

The ﬁrst one consists of the four use cases no. 7[64],

27 [45], 42 [40], and 43 [33], aiming to deal with the growing

amount of medical data and to improve the analysis quality in

the healthcare area through the integration of additional data.

In the ﬁrst use case of this group [64], a Hadoop ecosystem to

deal with the volume, variety, and velocity of medical data,

coming from various applications and devices is proposed.

The use case no. 42 [40] deals with the analysis of human

genome data, which is continuously growing and even HPC

clusters cannot deal with the challenge to process this amount

of data. As a solution to the problem, a Hadoop system is

proposed. In [45] a framework that allows querying both

structured and unstructured medical data is introduced. The

main goal is to improve the decision basis for medical experts.

The last case [33] in this group presents a cloud-based

analytics solution that should turn the massive amounts of

medical data into value. In almost all cases, structured and

unstructured data, located in the medical area, are used. While

the ﬁrst mainly contains structured documents like patient

records like the name, age, or previous diagnosis, the latter

comprises images, clinical notes, unstructured documents,

and genome data [33], [40], [45], [64]. Due to the handling of

personally identiﬁable information, which falls under special

regulations, sophisticated security measurements and storage

solutions are required in the ﬁnal system [33]. Again it was

noticed that data fusion is of major importance, because of

the use of data originating from different (healthcare) insti-

tutions, devices, and other sources. For example, in [64] a

system is proposed that is intended to improve the healthcare

situation in Algeria by an efﬁcient distribution of medical

resources and staff. To achieve this, information from one

university hospital, ﬁve public hospitals, one medical school,

51 polyclinics and some laboratories needs to be integrated.

For the technical implementation, all of the use cases rely on

the HDFS. Furthermore, it was found out that in none of the

use cases a real-time processing was needed [40], [64]. For

the analysis itself, all use cases utilize basic statistics and data

mining.

Compared to the cases described before, all of the remain-

ing, including 10 [37], 11 [65], and 26 [73], follow a differ-

ent aim. In here, the linkage of data from different sources

is of major interest. The ﬁrst case [37] of those presents

a framework for the detection of insurance frauds. At the

moment, insurance fraud detection methods are solely being

used in separate ﬁelds like healthcare, ﬁnancial services, and

others. To ﬁnd a suitable solution and to achieve a broad cov-

erage, data from 34 different ﬁelds, including sources such

as customer information, contracts and insurance claims, are

integrated [37]. Case no. 11 [65] presents an architecture for

the processing of data coming from different social media

channeld and in [73] a data management system environment

that is supposed to deal with unstructured data from various

sources is introduced. All of those case studies require the

processing of unstructured data [37], [65], [73]. Most of

the considered data such as insurance contracts and claims

are in an unstructured textual form [37]. As the cases in

this subgroup propose frameworks and architectures for the

integration of the vast amount of data from different sources,

the cleaning of the original data constituted an important

step. This includes auxiliary activities, like outlier detection

or ﬁxing missing values [37], [73]. Similar to the previous

group, real-time data processing is not required.

7) DESCRIPTION OF CLUSTER 7

The last cluster, resulting from the hierarchical clustering,

contains the ten use cases no. 12 [50], 18 [55], 20 [72],

24 [57], 31 [68], 32 [69], 36 [76], 38 [62], 40 [77] and

41 [39]. Similar to the sixth cluster, the use cases can be split

in two separate groups, regarding their reasons for using big

data technologies.

The ﬁrst group consists of the three cases [55], [62], [68]

that are focused on supporting informed decision making,

mainly in the healthcare area. The ﬁrst case study attempts

to provide a basis for precision medicine, by facilitating the

data analysis of various molecular proﬁles [55]. The second

proposes a general framework that should enable the integra-

tion of healthcare data from various resources and thus allow

researchers to conduct innovative types of analysis [62]. The

last case in this group aims to improve the daily practices

in a hospital by utilizing transactional data [84]. All of the

three cases make use of heterogeneous, unstructured data of

different content formats, such as text, images (e.g. diagnostic

tests), signals, and phenotypes [62]. Besides that, personally

identiﬁable information, in the form of patient records, are

used [55], [62], [68]. A further important feature that is shared

in all of the case studies, is the initial data pre-processing.

Here, it constitutes a crucial step as it increases the quality

of the information used for the decision-making process. It is

186610 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

for example used to deal with spelling and grammar mistakes

or anonymization [55], [62], [68]. The analysis, used for the

decision making, is performed by the help of basic statistics

and data mining techniques. Different associations, classi-

ﬁcation, and clustering methods are incorporated, in order

to discover relevant patterns, as well as to ﬁgure out inter-

attribute correlations and important relations [62], [68].

The second group contains the remaining six documents,

namely the cases no. 12 [50], 20 [72], 24 [57], 32 [69],

36 [76] and 41 [39]. All of them share the aim to enable

the real-time analysis of data, incoming with high-speed.

In the ﬁrst of the case studies, a framework for the online

analysis of high-speed physiological data is proposed, which

should improve the neonatal intensive care [50]. The sec-

ond one uses real-time analysis to forecast the power output

of solar plants [72]. Within the third case study, a frame-

work for the analysis of patent information and its usage

for research and development is introduced [57]. The sub-

sequent case no. 32 [69] proposes an analytic platform for

smart transportation, which analyzes data from heteroge-

neous data sources such as sensors and cameras in real-time.

In [76] real-time analysis is used to process social media

data for a disaster management system. The last use case

uses social media data to improve the quality of passenger’s

experience [39]. As one may note, all of the described use

cases in this group focus on real-time data analysis. The

origin for this circumstance arises presumably out of the high

velocity.

A further important feature of this group is the processing

of heterogeneous data that is coming from different devices

or source. For example, cameras and trafﬁc sensors are used

as an input for the realization of a smart transportation system

in [69]. Apart from that, social media data, for example

provided by Twitter, Google+, or YouTube, can be harnessed

for disaster management or the improvement of the quality

of the customer’s experience [39], [76]. For case no. [72]

structured, semi-structured (e.g. weather forecast data) and

unstructured data (e.g. customer behavior, video ﬁles) are

used. Because of the ﬂexibility and scalability, the HDFS is

incorperated as the foundation for each use case. The ﬁnal

results are eventually visualized and presented for instance

by dashboards [72], bar charts [57], or a decision map [76].

D. DERIVED STANDARD USE CASES

From the ﬁrst explanation and qualitative examination of the

seven found out clusters it was noted that some of the use

cases did not match properly to each other, even though they

were assigned to one agglomerated cluster. Hence, modiﬁ-

cations were required in different ways, such as insertion,

deletion, and consolidation, to better highlight the data char-

acteristics, used methods, and aim of the use case. Above

all, this was required to ensure that not only key indicators,

such as the distance, are used for the creation of the stan-

dard use cases, but also qualitative assessments are realized.

Subsequentelly, in total, nine different clusters were derived

from the qualitative examination. Those are depicted

in Table 7.

In the ﬁrst and second column, the number of the derived

cluster as well as the mapped use cases are stated. The

general aim, all relevant features, and the needed modiﬁca-

tions are described in the remaining columns.For instance,

it was noted that the ﬁrst identiﬁed groups of clusters three

and four, containing cases no. 19, 21, 30, and 4, 23, 35,

share similar interests. Besides the general focus on smart

cities, also the same characteristics are shared, except for

one case (no. 23) that uses near-real-time processing instead

of real-time-processing [56]. Hence, both of them were

merged into one new cluster (cf. Table 7 cluster no. 3).

All remaining cases of the initially calculated third clus-

ter, focusing on sensor analysis, became the new second

cluster.

Furthermore, the cases 16, 33, 34, 40 appeared to be as

outliers, not only for the respective fourth cluster but in parts

also for the entire dataset. This does not represent an error

in the qualitative analysis but rather how heterogeneous the

individual use cases can be. For instance, in case no. 16 [53],

the main goal was to improve the query performance of

a library information system. Within the given collection,

this was the only case that solely handled structured data.

Case no. 33 [70] exclusively used graphical-processing of

the data to efﬁciently handle queries on multiple integrated

bioinformatics databases. To prevent misleading information,

those use cases were removed or assigned to another cluster.

Case no. 14 [51] proposes a smart clinical workﬂow that aims

to increase the volume of medical data that can be processed.

In doing so, health data from different sources are integrated

and used to facilitate predictive therapy and to improve the

wellbeing of the patient. Due to the similarities to the ﬁrst

cluster and the goal to generally improve the quality of the

performed analysis, this case was assigned to said cluster.

As already highlighted during the description of the orig-

inal third cluster, the cases no. 17 [54], 22 [67] and 25 [58]

revealed no real interconnection to the overall features pre-

sented in this cluster. By comparing those cases, it was found

out that all of them aim to optimize existing processes by

using big data. Eventually, a new cluster was manually built

(cf. Table 7 cluster no. 9). The separate groups, which were

identiﬁed within the initially calculated sixth and seventh

cluster were extracted and declared as a separate one. Hence,

out of both clusters, two additional ones emerged (cf. Table 7

cluster no. 5-8).

Overall, for most of the initially formed clusters, only

minor modiﬁcations were required. Researchers, as well as

practitioners, can utilize these to obtain an idea not only

about the general meaningfulness of their own project but

also possible implementation details from speciﬁc use case

descriptions. This is especially the case if a similar approach

is pursued. For an increased understandability, within the

following paragraphs, each of the found out standard use

cases are brieﬂy described, comprising the common features

in a narrative way.

VOLUME 8, 2020 186611

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

TABLE 7. Standard Use Cases (UC) derived from the clusters. TABLE 7. (Continued.) Standard Use Cases (UC) derived from the clusters.

1) STANDARD USE CASE 1 – DATA ANALYSIS

IMPROVEMENT

By adopting big data technologies, an improvement in the

quality of the data analysis is pursued. A signiﬁcant step

to achieve this aim is to make sense of massive amounts of

unstructured data that are coming with high-speed, and the

exploitation of sophisticated methods, such as deep learn-

ing. Additionally to that, statistics and classiﬁcation methods

are often used to increase the quality of the analysis. The

described characteristics of this general case and the used

methods can be mapped to different cases, coming from

healthcare, transportation, manufacturing areas, and social

media. Details of the particular cases can be viewed in [38],

[41], [42], [48], [61], [63], [71], [85].

2) STANDARD USE CASE 2 – BATCH MODE SENSOR DATA

ANALYSIS

One of the reasons for harnessing big data technologies is

to enable the processing of large amounts of (IoT) sensor

data to obtain new insights. Key factors in this use case

are the integration of different data sources, such as sen-

sors and devices, as well as enabling the data exchange

between users and applications. The data commonly does

not exist in a structured format, thus processing unstructured

data plays an important role. Real-time processing is not

required, as the data is ﬁrstly gathered and then processed in

batch mode. To uncover different types of patterns, clustering

approaches are used for the analysis. The visualization of the

processed data is crucial to represent the ﬁndings. Based on

those, strategies, for instance, to improve the user experience,

resource allocation, process costs, and others, can be devel-

oped. Concrete speciﬁcations for this standard use case are

explained in [36], [43], [49].

3) STANDARD USE CASE 3 – SMART CITY

This category deals with the challenges of smart cities by

involving various resources in real-time data analysis. The

concept itself utilizes data from various devices, sensors,

and human actors to improve the quality of life for citizens.

For this purpose, structured, unstructured as well as transient

and permanent data can be used as analysis input. In order

to turn a large amount of heterogeneous data into value,

deep learning algorithms are used. In this case, a robust

storage solution for massive amounts of differently structured

186612 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

data should be used, such as a NoSQL database. Due to

the nature of this domain, personal information have to be

recognized and privacy-preserving techniques are applied.

All related cases are comprehensively described in [44],

[47], [56], [59], [60], [66].

4) STANDARD USE CASE 4 – MULTI-LEVEL PROBLEMS

In this standard use case, sophisticated multi-level problems

are stated, which require thorough planning from different

perspectives, covering not only the needed system but also the

data being processed. Organizations facing those problems

are confronted, in particular, with the growing amount of data

coming from various institutions, such as in the healthcare

sector. Apart from the required high reliability of the targeted

solution and the needed ability to efﬁciently search, query

and store the data, also privacy-preserving techniques have to

be considered. Moreover, processing unstructured data, such

as handwritten documents or images, needs to be enabled.

For the analysis of the data, different data mining approaches,

which analyze stored data (e.g. on an HDFS) in batch mode,

can be considered. This standard use case originates from the

following contributions [46], [52], [74].

5) STANDARD USE CASE 5 – EXPAND DATA SOURCING

In this case, data coming from various resources needs to

be combined into one functioning system. As the consid-

ered data originates from different sources or instances, not

only the structure but also the data itself can be highly

volatile. Due to this reason, not only sophisticated stor-

age solutions for those various types of data (e.g. NoSQL),

but also sophisticated pre-processing techniques are needed.

After the initial collection and cleaning, various statistical

methods can be used. The data is usally processed in batch-

mode. Concrete details of all relevant use cases can be found

in [33], [40], [45], [64].

6) STANDARD USE CASE 6 – DATA CONNECTION

Adopting big data technologies in areas with widespread

collections of information can improve decision-making

by incorporating a larger information basis. As wrong

decisions, especially in domains like healthcare, can have

enormous consequences, guaranteeing the correctness of the

analyzed data is a signiﬁcant step, necessitating extensive

pre-processing. Depending on the application area, this can

additionally require special processing steps like anonymiza-

tion or classiﬁcation. For the analysis, data mining techniques

can be used and also efﬁcient querying and searching over the

data in real-time should be enabled. Further information are

provided in [37], [65], [73].

7) STANDARD USE CASE 7 – DECISION SUPPORT

Real-time analytics on differently structured data are used in

those use cases, to facilitate decision support for data-driven

problems. Through basic statistics, classiﬁcations and other

analytical methods, previously unused ata are converted into

valuable information. For a better presentation of the obtained

results, visualization techniques are highly important. This

use case can be characterized by the phrase turn volume into

value. Details can be observed in [55], [62], [68].

8) STANDARD USE CASE 8 – HIGH-SPEED ANALYSIS

Within this use case, the input data comes in a structured

and unstructured format and needs to be processed in (near-)

real-time, to ensure that all functionalities and results can

be immediately provided. In addition, to maintain, search,

query, index and analyze all data, complex solutions are

required. For a comprehensible representation of the results

and the performed calculations, visualization techniques are

paramount. For particular insights, the following contribu-

tions can be used [39], [50], [57], [69], [72], [76].

9) STANDARD USE CASE 9 – PROCESS OPTIMIZATION

Big data technologies turned out to be an enabler for the

general optimization of existing processes. Usually, the data

is incoming with high velocity and needs to be processed in

real-time. However, also batch-processing mode should be

available either as a backup solution or for speciﬁc analytical

tasks. In this case, both, structured and unstructured data

are considered. Clustering techniques support the identiﬁca-

tion of recommendations with which existing processes can

be optimized. Various visualization techniques allow for a

better presentation in an appealing way. Further details are

described in [54], [58], [67].

IV. EVALUATION

In order to check the validity of the artifact and, thus, the pro-

posed standard use cases, a thorough evaluation is required

at which multiple aspects are veriﬁed [27], [28]. On the

one hand, it is necessary to assure a sufﬁcient coverage of

the regarded application domains, methods as well as data

characteristics and on the other hand, the undertaken trade-off

between possible degrees of fragmentation has to be looked at

(cluster building). To assess the coverage with a practical ori-

entation, an approach that is inspired by machine learning’s

division into training and test data is utilized. For this purpose,

the steps of the literature review are replicated while applying

the same criteria (cf. Table 1), but this time only for selected

case studies published in 2019.

Apart from the search procedure, also the comprehen-

siveness check through the use of the altered template was

performed. In total three additional use cases were found and

used for the evaluation of the found out results. Since those

cases were not involved in the creation of the standard use

cases, they function as the equivalent of a test data set. The

ﬁrst case study used for the evaluation comes from the area

of online retail [86]. It provides an approach for a recom-

mendation system that can be realized in an online store,

requiring a user to sign up. Besides harnessing historical

and transactional data, it also makes use of the customer’s

browsing history. Taking a look at Table 7 and considering

that the case study involves both structured as well as unstruc-

tured data and requires real-time processing, regarding those

VOLUME 8, 2020 186613

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

FIGURE 4. The DSR Grid according to (Vom Brocke and Maedche, 2019).

features, this one can be placed in the fourth, eighth, and ninth

cluster. However, the used recommender engine in this case

study has a key role in the data analysis, which reduces the

categorization possibilities only to the ninth cluster.

In conclusion, the analyzed case study aims to improve the

online retail by introducing personal recommendations, based

on real-time processed browsing data, ﬁtting the ninth general

use case that has the target to optimize existing processes

with the deployment of big data technologies. The second

case study presents a system for the incorperation of real-time

social media data in the analysis in the area of tourism [87].

The analysis of the data comprises the following main steps

– gathering of the data, cleaning and storage, querying and

ﬁltering as well as the visualization of the results. The data

is collected from different social media sources in this case

– Instagram, Flickr, Foursquare and Twitter, resulting in an

unstructured content format that can manifest in the form

of posts, reviews, images or videos. Utilizing Table 7 and

considering the case study’s aim to involve real-time social

media data in the analysis, this one can be placed in the eighth

cluster, which targets real-time analysis of data, incoming

with high-speed. The last case study, harnessed to evaluate the

identiﬁed use cases, originates from the area of smart trans-

portation [88]. Compared to the already existing approaches,

which deal with single issues like congestion avoidance or

environmental-friendly driving, the considered case study

shows a system that proposes a solution to multiple problems.

It aims to track vehicles, suggest optimal routes and realize a

smart parking concept, utilizing predominantly unstructured

data from various sources like sensors, cars or navigation

systems. With regard to Table 7, this example ﬁts into the

third general use case.

In conclusion, the successful categorization of the three

evaluation case studies in one of the deﬁned general use cases

suggests that adequate coverage was achieved. The degree of

fragmentation, in turn, is based on the intended application

scenario. While a more general approach might increase the

coverage even further, it offers no clear orientation in the

selection of potentially similar case studies. Vice versa, every

case as its own category would effectively negate the idea of a

categorization. For that reason, the current number constitutes

a trade-off that allows for a choice of relevant properties,

while still providing several example cases as a knowledge

base for the aspired endeavor.

V. CONCLUSION

In recent years, big data has been one of the most promi-

nent topics in the IT-sector. However, there is still a lot of

unawareness and uncertainty when it comes to the execution

of such projects, especially right at the beginning of the

planning phase. Hence, in the contribution at hand, an in-

depth investigation of successfully conducted projects was

performed, to provide future practitioners as well as other

researchers, inter alia, with decision support concerning the

realization of their potential big data projects. As a result

of a literature review, 43 cases published between 2015 and

2018 were identiﬁed. Those cover detailed information about

the presented big data projects. To achieve a categorization

for the obtained results, all use case descriptions were thor-

oughly examined using a textual analysis technique. At this

stage, the hierarchical clustering proofed to be a promising

solution, revealing various clusters with a similar feature set.

Based on the gathered information and further modiﬁcation,

a total of nine distinct clusters were identiﬁed.

186614 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

TABLE 8. Amended and adapted use case template.

VOLUME 8, 2020 186615

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

TABLE 8. (Continued.) Amended and adapted use case template.

Subsequently, those standard use cases constitute the arti-

fact of the conducted DSR endeavor as well as the answer to

the research question. To summarize and highlight the main

pillars, implications and key aspects of this research, in the

following, the corresponding DSR grid according to vom

Brocke and Maedche [89] is depicted in Figure 4. One part

of the contribution is constituted by the collection, structur-

ing and presentation of comprehensively described use cases

published in recent years, in the academic area. Additionally,

a template was used and modiﬁed for the analysis of the

identiﬁed cases. Through the use of this template, the gen-

eral comprehensiveness of one’s endeavor can be validated

and possible shortcomings or unrecognized gaps identiﬁed.

Beyond that, a presentation of standard use cases, derived

from the investigated publications, is made that can serve as

an orientation and initial starting point for the realization of

related projects. Consequently, researchers as well as practi-

tioners may greatly beneﬁt from the results discussed in this

work.

A. LIMITATIONS AND FUTURE RESEARCH

Although a suitable answer to the initially formulated RQ

was achieved, certain aspects have to be mentioned, which

may call for future optimization or new research directions.

This refers not only to the results as such but also to the

methods used to achieve them. For instance, this includes the

recognition of additional weightings during the analysis of

the input matrix, since sometimes a particular feature appears

to be more important than another one. An example for this

are features that are directly related to the data characteristics

or methods used to analyze them.

During the qualitative analysis of the use cases it was

noticed that many of the project descriptions also contained

concrete speciﬁcations. However, most of the decisions are

tailor-made. Hence, an additional investigation of them may,

in turn, result in an even more complicated analysis of the

data. For now, implementation details can be viewed in each

of the referred cases within the standard use cases. Addi-

tionally, the chosen algorithm represents only one suitable

way for the creation of the found out clusters. Apart from

the discussion of the available algorithms and their poten-

tial usability, also other algorithms were alternatively tested,

especially with a view on future enlargements of the dataset,

for which the manual processing of each case would require

too much effort. Hence, for the creation of the input matrix,

a computer-supported solution was tested that automatically

processes the data and identiﬁes important phrases on the

base of the term frequency. Despite a thorough pre-processing

procedure, which focused on the cleaning of unnecessary stop

words, endings, and inconsistent descriptions, no promis-

ing results were found. Even after an additional ﬁlter-

ing step of the found out phrases, the clusters had too

many dissimilarities. This was not only assessed on the qual-

itative level but also due to multiple irrelevant phrases, such

186616 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

as ‘‘the authors’’ or ‘‘it has’’. Nevertheless, as already shown

before, this has not achieved the desired effect.

Another tested method was the use of natural language

processing to uncover existing topics in a collection of unpro-

cessed textual documents [90]. In particular, the topic mod-

eling approach LDA was examined, which is a probabilistic

model that considers each topic as a combination of keywords

and each document as a combination of multiple topics.

Even though comprehensive pre-processing steps, such as

lemmatization, stop words and punctuation removal, were

repeatedly performed [90], no satisfying results were found.

Frequently, it was noticed that buzzwords, especially used

in the introduction and conclusion of the papers, have been

identiﬁed.

The evaluation of the coverage of the formulated cases

had a positive result, however, the sample size was rather

small and especially future big data projects might poten-

tially necessitate adjustments. Consequently, the sample size

of the found big data projects could be enlarged. At this

point, an extension of the actual dataset could be realized

through the investigation of additional years, further literature

databases and also by interviewing larger companies, that

conduct big data-related projects. Beyond that, also a long-

term evaluation is planned, which shall be realized through

the application at the very beginning of a project. Here, not

only the general meaningfulness but also the possibility to

derive concrete information and implications from the indi-

vidual use cases could be tested.

By referring to this, an implementation of the derived stan-

dard use cases within a concrete decision support system for

big data projects, for instance conceptually described in [23],

appears to be a promising direction for future research.

APPENDIX

See Table 8.

REFERENCES

[1] C. Dobre and F. Xhafa, ‘‘Intelligent services for big data science,’’ Future

Gener. Comput. Syst., vol. 37, pp. 267–281, Jul. 2014, doi: 10.1016/j.

future.2013.07.014.

[2] S. Yin and O. Kaynak, ‘‘Big data for modern industry: Challenges

and trends [point of view],’’ Proc. IEEE, vol. 103, no. 2, pp. 143–146,

Feb. 2015, doi: 10.1109/JPROC.2015.2388958.

[3] X. Jin, B. W. Wah, X. Cheng, and Y. Wang, ‘‘Signiﬁcance and challenges

of big data research,’’ Big Data Res., vol. 2, no. 2, pp. 59–64, Jun. 2015,

doi: 10.1016/j.bdr.2015.01.006.

[4] L. Zhu, F. R. Yu, Y. Wang, B. Ning, and T. Tang, ‘‘Big data analytics in

intelligent transportation systems: A survey,’’ IEEE Trans. Intell. Transp.

Syst., vol. 20, no. 1, pp. 383–398, Jan. 2019, doi: 10.1109/TITS.2018.

2815678.

[5] W. L. Chang and N. Grady, NIST Big Data Interoperability Framework—

Deﬁnitions. Gaithersburg, MD, USA: NIST, 2019. Accessed: Jul. 14, 2020,

doi: 10.6028/NIST.SP.1500-1r2

[6] A. Gandomi and M. Haider, ‘‘Beyond the hype: Big data concepts, meth-

ods, and analytics,’’ Int. J. Inf. Manage., vol. 35, no. 2, pp. 137–144,

Apr. 2015, doi: 10.1016/j.ijinfomgt.2014.10.007.

[7] S. Kaisler, F. Armour, J. A. Espinosa, and W. Money, ‘‘Big data: Issues

and challenges moving forward,’’ in Proc. 46th Hawaii Int. Conf. Syst. Sci.,

Jan. 2013, pp. 995–1004.

[8] D. Izadi, J. Abawajy, S. Ghanavati, and T. Herawan, ‘‘A data fusion method

in wireless sensor networks,’’ Sensors, vol. 15, no. 2, pp. 2964–2979,

Jan. 2015, doi: 10.3390/s150202964.

[9] H. Lee, N. Aydin, Y. Choi, S. Lekhavat, and Z. Irani, ‘‘A decision sup-

port system for vessel speed decision in maritime logistics using weather

archive big data,’’ Comput. Oper. Res., vol. 98, pp. 330–342, Oct. 2018,

doi: 10.1016/j.cor.2017.06.005.

[10] A. P. Plageras, K. Psannis, C. Stergiou, H. Wang, and B. B. Gupta. (2018).

Efﬁcient IoT-Based Sensor BIG Data Collection-Processing and Analysis

in Smart Buildings. [Online]. Available: https://www.semanticscholar.

org/paper/Efﬁcient-IoT-based-sensor-BIG-Data-and-analysis-in-

Plageras-Psannis/fb18e87bdfa27b3bc7a9d9337f02cd6b66d0c372

[11] K. E. Psannis, C. Stergiou, and B. B. Gupta, ‘‘Advanced media-based smart

big data on intelligent cloud systems,’’ IEEE Trans. Sustain. Comput.,

vol. 4, no. 1, pp. 77–87, Jan. 2019, doi: 10.1109/TSUSC.2018.2817043.

[12] Y. Wang, L. Kung, W. Y. C. Wang, and C. Cegielski, ‘‘Developing a big

data-enabled transformation model in healthcare: A practice based view,’’

in Proc. 25th Int. Conf. Inf. Syst., 2014, pp. 1–12.

[13] P. Aversa, L. Cabantous, and S. Haeﬂiger, ‘‘When decision support sys-

tems fail: Insights for strategic information systems from formula 1,’’

J. Strategic Inf.Syst., vol. 27, no. 3, pp. 221–236, Sep. 2018, doi: 10.1016/j.

jsis.2018.03.002.

[14] R. Häusler, D. Staegemann, M. Volk, S. Bosse, C. Bekel, and K. Turowski,

‘‘Generating content-compliant training data in big data education,’’ in

Proc. 12th Int. Conf. Comput. Supported Edu., 2020, pp. 104–110.

[15] T. Nguyen, L. Zhou, V. Spiegler, P. Ieromonachou, and Y. Lin, ‘‘Big

data analytics in supply chain management: A state-of-the-art literature

review,’’ Comput. Oper. Res., vol. 98, pp. 254–264, Oct. 2018, doi: 10.

1016/j.cor.2017.07.004.

[16] O. Müller, M. Fay, and J. vom Brocke, ‘‘The effect of big data and analytics

on ﬁrm performance: An econometric analysis considering industry char-

acteristics,’’ J. Manage. Inf. Syst., vol. 35, no. 2, pp. 488–509, Apr. 2018,

doi: 10.1080/07421222.2018.1451955.

[17] S. F. Wamba, S. Akter, A. Edwards, G. Chopin, and D. Gnanzou, ‘‘How

‘big data’ can make big impact: Findings from a systematic review and

a longitudinal case study,’’ Int. J. Prod. Econ., vol. 165, pp. 234–246,

Jul. 2015, doi: 10.1016/j.ijpe.2014.12.031.

[18] Z. A. Al-Sai, R. Abdullah, and M. H. Husin, ‘‘Critical success fac-

tors for big data: A systematic literature review,’’ IEEE Access, vol. 8,

pp. 118940–118956, 2020, doi: 10.1109/ACCESS.2020.3005461.

[19] D. Staegemann, M. Volk, A. Nahhas, M. Abdallah, and K. Turowski,

‘‘Exploring the speciﬁcities and challenges of testing big data systems,’’ in

Proc. 15th Int. Conf. Signal-Image Technol. Internet-Based Syst. (SITIS),

Nov. 2019, pp. 289–295.

[20] D. Staegemann, M. Volk, N. Jamous, and K. Turowski, ‘‘Understanding

issues in big data applications—A multidimensional endeavor,’’ in Proc.

25th Amer. Conf. Inf. Syst., 2019, pp. 1–10.

[21] S. Sagiroglu and D. Sinanc, ‘‘Big data: A review,’’ in Proc. Int. Conf.

Collaboration Technol. Syst. (CTS), May 2013, pp. 42–47.

[22] S. Bonesso, E. Bruni, and F. Gerli, ‘‘How big data creates new job

opportunities: Skill proﬁles of emerging professional roles,’’ in Behavioral

Competencies of Digital Professionals: Understanding the Role of Emo-

tional Intelligence, S. Bonesso, E. Bruni, F.Gerli, Eds. Cham, Switzerland:

Palgrave Macmillan, 2020, pp. 21–39.

[23] M. Volk, D. Staegemann, M. Pohl, and K. Turowski, ‘‘Challenging big data

engineering: Positioning of current and future development,’’ in Proc. 4th

Int. Conf. Internet Things, Big Data Secur., 2019, pp. 351–358.

[24] M. Volk, D. Staegemann, S. Bosse, A. Nahhas, and K. Turowski, ‘‘Towards

a decision support system for big data projects,’’ in WI2020 Zentrale

Tracks, N. Gronau, M. Heine, K. Poustcchi, and H. Krasnova, Eds. Berlin,

Germany: GITO Verlag, 2019, pp. 357–368.

[25] R. Dontha. The Origins of Big Data—KDnuggets. Accessed: Jun. 16, 2020.

[Online]. Available: https://www.kdnuggets.com/2017/02/origins-big-

data.html

[26] O. Ylijoki and J. Porras, ‘‘Conceptualizing big data: Analysis of case

studies,’’ Intell. Syst. Accounting, Finance Manage., vol. 23, no. 4,

pp. 295–310, Oct. 2016, doi: 10.1002/isaf.1393.

[27] R. H. Von Alan, S. T. March, J. Park, and S. Ram, ‘‘Design science in

information systems research,’’ MIS Quart., vol. 28, no. 1, pp. 75–105,

2004.

[28] K. Peffers, T. Tuunanen, M. A. Rothenberger, and S. Chatterjee, ‘‘A design

science research methodology for information systems research,’’ J. Man-

age. Inf. Syst., vol. 24, no. 3, pp. 45–77, Dec. 2007.

[29] J. Webster and R. T. Watson, ‘‘Analyzing the Past to Prepare for the Future:

Writing a Literature Review,’’ MIS Quart., vol. 26, no. 2, pp. 13–23, 2002.

[Online]. Available: http://www.jstor.org/stable/4132319

VOLUME 8, 2020 186617

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

[30] Y. Levy and T. J. Ellis, ‘‘A systems approach to conduct an effective

literature review in support of information systems research,’’ Informing

Sci., Int. J. Emerg. Transdiscipline, vol. 9, pp. 181–212, Jan. 2006, doi:

10.28945/479.

[31] R. Bauer. 3 Reasons Why You Need Business Case Studies—PAN Com-

munications. Accessed: Mar. 18, 2020. [Online]. Available: https://www.

pancommunications.com/blog/3-reasons-why-you-need-business-case-

studies/

[32] R. K. Yin, Case Study Research and Applications: Design and Methods.

Los Angeles, CA, USA: SAGE, 2018.

[33] F. Khalid. (2017). Inovalation is Driving Healthcare Transformation With

Pre-Engineered Infrastructure and Big Data Analytics. Dell EMC, Bowie,

MD, USA. Accessed: Jun. 1, 2020. [Online]. Available: https://www.emc.

com/collateral/customer-proﬁles/inovalon-vscale-case-study.pdf

[34] W. L. Chang and G. Fox. (2018). NIST Big Data Interoperability

Framework—Use Cases and General Requirements. Gaithersburg, MD,

USA. Accessed: Jul. 14, 2020. [Online]. Available: https://nvlpubs.nist.

gov/nistpubs/SpecialPublications/NIST.SP.1500-3r1.pdf

[35] J. Vom Brocke, A. Simons, B. Niehaves, K. Reimer, R. Plattfaut, and

A. Cleven, ‘‘Reconstructing the giant: On the importance of rigour in

documenting the literature search process,’’ in Proc. ECIS, Verona, Italy,

2009, pp. 1–14.

[36] A. Yassine, S. Singh, M. S. Hossain, and G. Muhammad, ‘‘IoT big data

analytics for smart homes with fog and cloud computing,’’ Future Gener.

Comput. Syst., vol. 91, pp. 563–573, Feb. 2019, doi: 10.1016/j.future.

2018.08.040.

[37] D. Kenyon and J. H. P. Eloff, ‘‘Big data science for predicting insur-

ance claims fraud,’’ in Proc. Inf. Secur. South Afr. (ISSA), Aug. 2017,

pp. 40–47.

[38] Y. Zhang, M. Zhang, T. Wo, X. Lin, R. Yang, and J. Xu, ‘‘A scal-

able lnternet-of-Vehicles service over joint clouds,’’ in Proc. IEEE Symp.

Service-Oriented Syst. Eng. (SOSE), Mar. 2018, pp. 210–215.

[39] H.-M. Chen, R. Schütz, R. Kazman, and F. Matthes, ‘‘How Lufthansa cap-

italized on big data for business model renovation,’’ MIS Quart. Executive,

vol. 16, no. 1, p. 4, 2017.

[40] M. Schlesner and F. Schinkel. (2016). Big Data Use Case: Genomic Data

Research. Fujitsu, Munich, Germany. Accessed: Sep. 4, 2019. [Online].

Available: https://www.datameer.com/wp-content/uploads/pdf/misc/cs-

PF4H-Genome-Research.pdf

[41] M. Avvenuti, S. Cresci, F. Del Vigna, T. Fagni, and M. Tesconi, ‘‘CrisMap:

A big data crisis mapping system based on damage detection and geop-

arsing,’’ Inf. Syst. Frontiers, vol. 20, no. 5, pp. 993–1011, Oct. 2018, doi:

10.1007/s10796-018-9833-z.

[42] M. K. Hassan, A. I. El Desouky, S. M. Elghamrawy, and A. M. Sarhan,

‘‘Intelligent hybrid remote patient-monitoring model with cloud-based

framework for knowledge discovery,’’ Comput. Electr. Eng., vol. 70,

pp. 1034–1048, Aug. 2018, doi: 10.1016/j.compeleceng.2018.02.032.

[43] F. Gu, B. Ma, J. Guo, P. A. Summers, and P. Hall, ‘‘Internet of Things and

big data as potential solutions to the problems in waste electrical and elec-

tronic equipment management: An exploratory study,’’ Waste Manage.,

vol. 68, pp. 434–448, Oct. 2017, doi: 10.1016/j.wasman.2017.07.037.

[44] Y. Arfat, M. Aqib, R. Mehmood, A. Albeshri, I. Katib, N. Albogami, and

A. Alzahrani, ‘‘Enabling smarter societies through mobile big data fogs

and clouds,’’ Procedia Comput. Sci., vol. 109, pp. 1128–1133, Jan. 2017,

doi: 10.1016/j.procs.2017.05.439.

[45] S. Istephan and M.-R. Siadat, ‘‘Unstructured medical image query

using big data—An epilepsy case study,’’ J. Biomed. Informat., vol. 59,

pp. 218–226, Feb. 2016, doi: 10.1016/j.jbi.2015.12.005.

[46] D. Mourtzis, E. Vlachou, and N. Milas, ‘‘Industrial big data as a result

of IoT adoption in manufacturing,’’ Procedia CIRP, vol. 55, pp. 290–295,

Jan. 2016, doi: 10.1016/j.procir.2016.07.038.

[47] S. Rinaldi, A. Flammini, M. Pasetti, L. C. Tagliabue, A. C. Ciribini, and

S. Zanoni, ‘‘Metrological issues in the integration of heterogeneous lot

devices for energy efﬁciency in cognitive buildings,’’ in Proc. IEEE Int.

Instrum. Meas. Technol. Conf. (I2MTC), May 2018, pp. 1–6.

[48] P. Ta-Shma, A. Akbar, G. Gerson-Golan, G. Hadash, F. Carrez, and

K. Moessner, ‘‘An ingestion and analytics architecture for IoT applied to

smart city use cases,’’ IEEE Internet Things J., vol. 5, no. 2, pp. 765–774,

Apr. 2018, doi: 10.1109/JIOT.2017.2722378.

[49] A. C. Onal, O. Berat Sezer, M. Ozbayoglu, and E. Dogdu, ‘‘Weather data

analysis and sensor fault detection using an extended IoT framework with

semantics, big data, and machine learning,’’ in Proc. IEEE Int. Conf. Big

Data (Big Data), Dec. 2017, pp. 2037–2046.

[50] S. Balaji, M. Patil, and C. McGregor, ‘‘A cloud based big data based online

health analytics for rural NICUs and PICUs in india: Opportunities and

challenges,’’ in Proc. IEEE 30th Int. Symp. Comput.-Based Med. Syst.

(CBMS), Jun. 2017, pp. 385–390.

[51] L. Carnevale, A. Celesti, M. Fazio, P. Bramanti, and M. Villari, ‘‘How to

enable clinical workﬂows to integrate big healthcare data,’’ in Proc. IEEE

Symp. Comput. Commun. (ISCC), Jul. 2017, pp. 857–862.

[52] S. Amini, I. Gerostathopoulos, and C. Prehofer, ‘‘Big data analytics archi-

tecture for real-time trafﬁc control,’’ in Proc. 5th IEEE Int. Conf. Models

Technol. Intell. Transp. Syst. (MT-ITS), Jun. 2017, pp. 710–715.

[53] Herrnansyah, Y. Ruldeviyani, and R. F. Aji, ‘‘Enhancing query perfor-

mance of library information systems using NoSQL DBMS: Case study

on library information systems of universitas indonesia,’’ in Proc. Int.

Workshop Big Data Inf. Secur. (IWBIS), Oct. 2016, pp. 41–46.

[54] I. Azimi, A. Anzanpour, A. M. Rahmani, P. Liljeberg, and T. Salakoski,

‘‘Medical warning system based on Internet of Things using fog com-

puting,’’ in Proc. Int. Workshop Big Data Inf. Secur. (IWBIS), Oct. 2016,

pp. 19–24.

[55] P.-Y. Wu, C.-W. Cheng, C. D. Kaddi, J. Venugopalan, R. Hoffman,

and M. D. Wang, ‘‘-Omic and electronic health record big data analytics

for precision medicine,’’ IEEE Trans. Bio-Med. Eng., vol. 64, no. 2,

pp. 263–273, Feb. 2017, doi: 10.1109/TBME.2016.2573285.

[56] E. Patti and A. Acquaviva, ‘‘IoT platform for smart cities: Requirements

and implementation case studies,’’ in Proc. IEEE 2nd Int. Forum Res.

Technol. Soc. Ind. Leveraging Better Tomorrow (RTSI), Sep. 2016, pp. 1–6.

[57] W. Seo, N. Kim, and S. Choi, ‘‘Bigdata framework for analyzing patents to

support strategic R&D planning,’’ in Auckland. IEEE, 2016, pp. 746–753.

[Online]. Available: https://ieeexplore.ieee.org/document/7588929

[58] V. Dineshreddy and G. R. Gangadharan, ‘‘Towards an, ‘Internet Things

framework for ﬁnancial services sector,’’ in Proc. 3rd Int. Conf. Recent

Adv. Inf. Technol. (RAIT), Dhanbad, India, Mar. 2016, pp. 177–181.

[59] Y. Sun, H. Song, A. J. Jara, and R. Bie, ‘‘Internet of Things and big data

analytics for smart and connected communities,’’ IEEE Access, vol. 4,

pp. 766–773, 2016, doi: 10.1109/ACCESS.2016.2529723.

[60] S. Pirttikangas, E. Gilman, X. Su, T. Leppanen, A. Keskinarkaus,

M. Rautiainen, M. Pyykkonen, and J. Riekki, ‘‘Experiences with smart city

trafﬁc pilot,’’ in Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2016,

pp. 1346–1352.

[61] A. Aguilera, R. Grunzke, U. Markwardt, D. Habich, D. Schollbach, and

J. Garcke, ‘‘Towards an industry data gateway: An integrated platform

for the analysis of wind turbine data,’’ in Proc. 7th Int. Workshop Sci.

Gateways, Jun. 2015, pp. 62–66.

[62] A. Abusharekh, S. A. Stewart, N. Hashemian, and S. S. R. Abidi,

‘‘H-DRIVE: A big health data analytics platform for evidence-informed

decision making,’’ in Proc. IEEE Int. Congr. Big Data, Jun. 2015,

pp. 416–423.

[63] M. Panahiazar, V. Taslimitehrani, A. Jadhav, and J. Pathak, ‘‘Empower-

ing personalized medicine with big data and semantic Web technology:

Promises, challenges, and use cases,’’ in Proc. IEEE Int. Conf. Big Data

(Big Data), Oct. 2014, pp. 790–795.

[64] A. Sebaa, F. Chikh, A. Nouicer, and A. Tari, ‘‘Medical big data ware-

house: Architecture and system design, a case study: Improving healthcare

resources distribution,’’ J. Med. Syst., vol. 42, no. 4, p. 59, Apr. 2018, doi:

10.1007/s10916-018-0894-9.

[65] J. F. Sánchez-Rada, A. Pascual, E. Conde, and C. A. Iglesias, ‘‘A big

linked data toolkit for social media analysis and visualization based

on W3C Web components,’’ in On Move to Meaningful Internet Sys-

tems. Valletta, Malta: Springer, 2018, pp. 498–515. [Online]. Available:

https://link.springer.com/chapter/10.1007/978-3-030-02671-4_30

[66] A. Smirnov, A. Ponomarev, N. Teslya, and N. Shilov, ‘‘Human-computer

cloud for smart cities: Tourist itinerary planning case study,’’ in Business

Information Systems Workshops (Lecture Notes in Business Information

Processing), vol. 303, W. Abramowicz, Ed. Cham, Switzerland: Springer,

2017, pp. 179–190.

[67] A. Aguilera, R. Grunzke, D. Habich, J. Luong, D. Schollbach,

U. Markwardt, and J. Garcke, ‘‘Advancing a gateway infrastructure for

wind turbine data analysis,’’ J. Grid Comput., vol. 14, no. 4, pp. 499–514,

Dec. 2016, doi: 10.1007/s10723-016-9376-9.

[68] R. S. Santos, T. A. Vaz, R. P. Santos, and J. M. P. de Oliveira, ‘‘Big

data analytics in a public general hospital,’’ in Machine Learning, Opti-

mization, and Big Data (Lecture Notes in Computer Science), vol. 10122,

P. M. Pardalos, P. Conca, G. Giuffrida, and G. Nicosia, Eds. Cham,

Switzerland: Springer, 2016, pp. 433–441.

186618 VOLUME 8, 2020

M. Volk et al.: Identifying Similarities of Big Data Projects—A Use Case Driven Approach

[69] H. Khazaei, S. Zareian, R. Veleda, and M. Litoiu, ‘‘Sipresk: A

big data analytic platform for smart transportation,’’ in Proc.

1st EAI International Summit, Smart City 360◦. Bratislava,

Slovakia: Springer, 2016, pp. 419–430. [Online]. Available:

https://link.springer.com/chapter/10.1007/978-3-319-33681-7_35

[70] A. Fiannaca, L. La Paglia, M. La Rosa, A. Messina, P. Storniolo, and

A. Urso, ‘‘Integrated DB for bioinformatics: A case study on analysis

of functional effect of MiRNA SNPs in cancer,’’ in Proc. Int. Conf. Inf.

Technol. Bio Med. Inform., Porto, Portugal, Sep. 2016, pp. 214–222.

[71] A. Majumdar and I. Bose, ‘‘Detection of ﬁnancial rumors using big data

analytics: The case of the bombay stock exchange,’’ J. Organizational

Comput. Electron. Commerce, vol. 28, no. 2, pp. 79–97, Apr. 2018, doi: 10.

1080/10919392.2018.1444337.

[72] G. Escobedo, N. Jacome, and G. Arroyo-Figueroa, ‘‘Big data & analytics

to support the renewable energy integration of smart grids—Case study:

Power solar generation,’’ in Proc. 2nd Int. Conf. Internet Things, Big Data

Secur. IoTBDS, Porto, Portugal, Apr. 2017, pp. 267–275.

[73] Y. Zhuang, Y. Wang, J. Shao, L. Chen, W. Lu, J. Sun, B. Wei, and J. Wu,

‘‘D-ocean: An unstructured data management system for data ocean envi-

ronment,’’ Frontiers Comput. Sci., vol. 10, no. 2, pp. 353–369, Apr. 2016,

doi: 10.1007/s11704-015-5045-6.

[74] C.-M. Chen, ‘‘Use cases and challenges in telecom big data analytics,’’

APSIPA Trans. Signal Inf. Process., vol. 5, pp. 1–7, Dec. 2016, doi:

10.1017/ATSIP.2016.20.

[75] M. F. Huber, M. Voigt, and A.-C. N. Ngomo, ‘‘Big data architec-

ture for the semantic analysis of complex events in manufacturing,’’

in Informatik. Bonn, Germany: Gesellschaft fpr Informatik e.V., 2016,

pp. 353–360. [Online]. Available: https://dl.gi.de/handle/20.500.12116/

1139;jsessionid=D794018779FF36E5A6CBE13273EE9C67

[76] Q. Huang, G. Cervone, D. Jing, and C. Chang, ‘‘DisasterMapper,’’ in

Proc. 4th Int. ACM SIGSPATIAL Workshop Anal. Big Geospatial Data

BigSpatial, 2015, pp. 1–6.

[77] M. Xu, S. Siraj, and L. Qi, ‘‘A Hadoop-based data processing platform for

fresh Agro products traceability,’’ in Proc. Eur. Conf. Data Mining, 2015,

pp. 37–44. [Online]. Available: http://www.iadisportal.org/components/

com_booklibrary/ebooks/201508L005.pdf

[78] M. Steinbach, G. Karypis, and V. Kumar, ‘‘A comparison of document

clustering techniques,’’ in Proc. TextMining Workshop KDD, May 2000,

pp. 1–20.

[79] Y. Zhao, G. Karypis, and U. Fayyad, ‘‘Hierarchical clustering algorithms

for document datasets,’’ Data Mining Knowl. Discovery, vol. 10, no. 2,

pp. 141–168, Mar. 2005, doi: 10.1007/s10618-005-0361-3.

[80] L. Kaufman and P. J. Rousseeuw, Finding Groupsin Data: An Introduction

to Cluster Analysis, 99th ed. Hoboken, NJ, USA: Wiley, 2009. [Online].

Available: http://gbv.eblib.com/patron/FullRecord.aspx?p=469065

[81] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms,

2nd ed. Piscataway, NJ, USA: IEEE Press, 2011.

[82] J. Cleve and U. Lämmel. Data Mining. München: De Gruyter Oldenbourg.

[Online]. Available: https://doi.org/10.1515/9783110456776

[83] P.-N. Tan, M. Steinbach, A. Karpatne, and V. Kumar, Introduction to Data

Mining. London, U.K.: Pearson, 2019.

[84] MATLAB. Agglomerative Hierarchical Cluster Tree—MATLAB

Linkage—MathWorks. Accessed: Mar. 27, 2020. [Online]. Available:

https://mathworks.com/help/stats/linkage.html?s_tid=mwa_osa_

a#d117e514451

[85] A. L. Marra, F. Martinelli, P. Mori, and A. Saracino, ‘‘Implementing usage

control in Internet of Things: A smart home use case,’’ in Proc. IEEE

Trustcom/BigDataSE/ICESS, Aug. 2017, pp. 1056–1063.

[86] G. Alﬁan, M. F. Ijaz, M. Syafrudin, M. A. Syaekhoni, N. L. Fitriyani, and

J. Rhee, ‘‘Customer behavior analysis using real-time data processing,’’

Asia Paciﬁc J. Marketing Logistics, vol. 31, no. 1, pp. 265–290, Jan. 2019,

doi: 10.1108/APJML-03-2018-0088.

[87] K. Vassakis, E. Petrakis, I. Kopanakis, J. Makridis, and G. Mastorakis,

‘‘Location-based social network data for tourism destinations,’’ in Big

Data and Innovation in Tourism, Travel, and Hospitality: Managerial

Approaches, Techniques, and Applications, M. Sigala, R. Rahimi, and

M. Thelwall, Eds. Singapore: Springer, 2019, pp. 105–114.

[88] S. Muthuramalingam, A. Bharathi, S. Rakesh kumar, N. Gayathri,

R. Sathiyaraj, and B. Balamurugan, ‘‘IoT based intelligent transporta-

tion system (IoT-ITS) for global perspective: A case study,’’ in Internet

of Things and Big Data Analytics for Smart Generation, V. E. Balas,

V. K. Solanki, R. Kumar, and M. Khari, Eds. Cham, Switzerland: Springer,

2019, pp. 279–300.

[89] J. vom Brocke and A. Maedche, ‘‘The DSR grid: Six core dimensions for

effectively planning and communicating design science research projects,’’

Electron. Markets, vol. 29, no. 3, pp. 379–385, Sep. 2019, doi: 10.1007/

s12525-019-00358-7.

[90] M. J. Zaki, Eds., On Finding the Natural Number of Topics With Latent

Dirichlet Allocation: Some Observations: Advances in Knowledge Discov-

ery and Data Mining. Berlin, Germany: Springer, 2010.

MATTHIAS VOLK (Graduate Student Member, IEEE) studied business

informatics at the Faculty of Computer Science, Otto von Guericke Univer-

sity Magdeburg (OVGU). He received the master’s degree in 2016. He is

currently pursuing the Ph.D. degree. Since then, he has been employed

as a Scientiﬁc Researcher. During his studies, he gained lots of practical

experience as a software developer in different companies such as Volk-

swagen. During his scientiﬁc career, he participated in many international

scientiﬁc congresses and projects, not only as a speaker but also as a reviewer

or a session chair. His research interests include domain of data-intensive

systems, related projects, technologies, and the management of them.

DANIEL STAEGEMANN studied computer science at Technical University

Berlin (TUB). He received the master’s degree in 2017. He is currently

pursuing the Ph.D. degree with the Otto von Guericke UniversityMagdeburg.

Since 2018, he has been employed as a Scientiﬁc Researcher with OVGU.

His research interest includes big data, especially the testing.

IVAYLA TRIFONOVA studied business informatics at the Otto von Guericke

University Magdeburg. She received the master’s degree in 2019. She is

currently working as an IT Consultant in the area of Life Science at a large

European consulting and IT services company.

SASCHA BOSSE studied computer science at the Faculty of Computer

Science, Otto von Guericke University Magdeburg. He received the mas-

ter’s degree in 2011, and the academic degree Doktoringenieur in 2016.

Since 2012, he has been working as a Researcher with the VLBA Lab.

Since 2020, he has also been working as a Subject Specialist for computer

science and mathematics with University Library Magdeburg, where he is

also responsible for the business applications. His research interests include

IT service management, modeling, simulation, and optimization.

KLAUS TUROWSKI studied business and engineering at the University of

Karlsruhe. He received the Ph.D. degree from the Institute for Business

Informatics, University of Münster, and the habilitated degree in business

informatics from the Faculty of Computer Science, Otto von Guericke Uni-

versity Magdeburg. In 2000, he deputized the Chair of business informatics

at the University of the Federal Armed Forces München. Since 2001, he has

been heading the Chair of business informatics and systems engineering with

the University of Augsburg. Since 2011, he has also been heading the Chair

of business informatics (AG WI) with the Otto von Guericke University

Magdeburg, the Very Large Business Applications Lab (VLBA Lab), and the

world’s largest SAP University Competence Center (SAP UCC Magdeburg).

Additionally, he worked as a guest lecturer at several universities around the

world. He was a Lecturer with the Universities of Darmstadt and Konstanz.

He was a (co-)organizer of a multiplicity of national and international

scientiﬁc congresses and workshops (>30) and acted as a member of several

programme commitees (>130), and expert Groups. In the context of his

university activities as well as an independent consultant he gained practical

experience in industry.

VOLUME 8, 2020 186619

An Overview of the Approaches for Generating Test Data in the Context of the Quality Assurance of Big Data Applications

Conference Paper

Nov 2023

With society’s increasing data production and the corresponding demand for systems that are capable of utilizing them, the big data domain has gained significant importance. However, besides the systems’ actual implementation, their testing also needs to be considered. For this, oftentimes, proper test data sets are necessary. This publication discusses several different approaches how these can be provisioned and, further, highlights the respective advantages, disadvantages, and suitable application scenarios. In doing so, researchers and practitioners that are implementing big data applications and need to test them, or who are generally interested in the domain, are supported in their own considerations on how to obtain test data.

Lowering Big Data Project Barriers: Identifying System Architecture Templates for Standard Use Cases in Big Data

Conference Paper

Jan 2022

For almost a decade now, big data has become the foundation of today's data-intensive systems used for various disciplines, such as data science or artificial intelligence. Although a certain level of maturity has been reached since then, not only in the domain itself but also in the engineering of interconnected systems, many problems still exist today. The number of available technologies and architectural concepts, whose application is often very use case-specific, makes the successful implementation of big data projects still a non-trivial undertaking. To overcome this problem and deliver support with the realization of a related project, existing standard use cases in this domain are analyzed, and architectural concepts are derived through the design science research methodology. By observing essential criteria, like use case descriptions as well as relevant requirements, decision-makers can harness architectural concepts and technology recommendations for their setup.

Proceedings of the 21st International Conference on Smart Business Technologies (ICSBT 2024 )

Book

Jun 2024

This book contains the proceedings of the 21st International Conference on Smart Business Technologies (ICSBT 2024). This year, ICSBT is held in collaboration with the ESEO, which hosts this event in Dijon, France, on July 9-11, 2024. It was sponsored by the Institute for Systems and Technologies of Information, Control and Communication (INSTICC). ICSBT 2024 was also organized in cooperation with the ACM Special Interest Group on Management Information Systems. The International Conference on Smart Business Technologies (formerly known as ICE-B - International Conference on e-Business), aims at bringing together researchers and practitioners who work on e-Business technology and its applications. The scope of the conference covers low-level technological issues, such as technology platforms, internet of things, artificial intelligence, data science and web services, but also higher-level issues, such as business processes, business intelligence, digital twins, value setting and business strategy. Furthermore, it covers different research approaches (like qualitative cases, experiments, forecasts, and simulations) to address these issues and different possible application domains (like manufacturing, service management and trade systems) with their own specific needs and requirements. We invite both more academic and practical oriented submissions, but we are especially interested in academic research with a potential practical impact and practical research papers with theoretical implications. ICSBT 2024 received 27 paper submissions from 13 countries of which 14.8% were accepted and published as full papers. A double-blind paper review was performed for each submission by at least 2 but usually 3 or more members of the International Program Committee, which is composed of established researchers and domain experts. The high quality of the ICSBT 2024 program is enhanced by the keynote lecture delivered by distinguished speakers who are renowned experts in their fields: Samuel Fosso Wamba (Toulouse Business School, France) and Sukhpal Singh Gill (Queen Many University of London, United Kingdom). All presented papers will be available at the SCITEPRESS Digital Library and will be submitted for evaluation for indexing by SCOPUS, Google Scholar, The DBLP Computer Science Bibliography, Semantic Scholar, Engineering Index and Web of Science / Conference Proceedings Citation Index. As recognition for the best contributions, several awards based on the combined marks of paper reviewing, as assessed by the Program Committee, and the quality of the presentation, as assessed by session chairs at the conference venue, are conferred at the closing session of the conference. Authors of selected papers will be invited to submit extended versions for inclusion in a forthcoming book of ICSBT Selected Papers to be published by Springer, as part of the CCIS Series. Some papers will also be selected for publication of extended and revised versions in the special issue of the Socio-Economic Planning Sciences and IMA Journal of Management Mathematics. The program for this conference required the dedicated effort of many people. Firstly, we must thank the authors, whose research efforts are herewith recorded. Next, we thank the members of the Program Committee and the auxiliary reviewers for their diligent and professional reviewing. We would also like to deeply thank the invited speakers for their invaluable contribution and for taking the time to prepare their talks. Finally, a word of appreciation for the hard work of the INSTICC team; organizing a conference of this level is a task that can only be achieved by the collaborative effort of a dedicated and highly competent team. We wish you all an exciting and inspiring conference. We hope to have contributed to the development of our research community, and we look forward to having additional research results presented at the next edition of ICSBT, details of which are available at https://icsbt.scitevents.org.

Factors Affecting Big Data Adoption: An Empirical Study in Small and Medium Enterprises in Vietnam

Article

Full-text available

Jan 2022

Nguyen Truong

This study examined factors impacting the big data adoption of small and medium enterprises (SMEs) in Vietnam. The mixed method study was used. The qualitative research was applied by a group discussion with 15 participants and a cross-sectional survey with 372 representatives of SMEs. The results show that perceived benefit, simplicity, compatibility, data quality, security and privacy, vendor support, management support, financial investment, perceived usefulness, and attitudes toward adoption. This research extended the academic framework and examined causal relationships by adopting new characteristics from the integrated perspective of TOE with TAM beyond the existing research models.

Antecedents of Big Data Analytic Adoption and Impacts on Performance: Contingent Effect

Article

Full-text available

Nov 2022

The adoption of big data analytics (BDA) is increasing pace both in practice and in theory, owing to the prospects and its potential advantages. Numerous researchers believe that BDA could provide significant advantages, despite constant battles with the constraints that limit its implementation. Here, we suggest an incorporated model to investigate the drivers and impacts of BDA adoption in the Jordanian hotel industry based on the technology–organisation–environment framework and the resource-based view theory. The suggested model incorporates both the adoption and performance components of BDA into a single model. For data collection, in this study, we used an online questionnaire survey. The research model was verified based on responses from 119 Jordanian hotels. This study yielded two significant findings. First, we discovered that relative advantage, organizational readiness, top management support, and government regulations have a major impact on BDA adoption. The study results also reveal a strong and favourable association between BDA adoption and firm performance. Finally, information sharing was found to have a moderating effect on the association between BDA adoption and firm performance. The data revealed how businesses might increase their BDA adoption for improved firm performance. The present study adds to the limited but growing body of literature investigating the drivers and consequences of technology acceptance. The findings of this study can serve as a resource for scholars and practitioners interested in big data adoption in emerging nations.

Drivers and impact of big data analytic adoption in the retail industry: A quantitative investigation applying structural equation modeling Technology adoption Retail industry Data volume Data variety Data velocity Diffusion of innovations model RBV theory

Article

Full-text available

Jan 2023
J Retailing Consum Serv

Big data analytics (BDA) adoption has gained attention in both practical and theoretical circles owing to the opportunities and advantages that can be reaped from it. In theory, the majority of researchers have evidenced the benefits of BDA, although barriers to its adoption have also been mentioned. This study draws upon the technology-organisation-environment framework and resource-based view theory to propose an integrated model that examines the drivers and impact of BDA adoption in the retail industry in Jordan. The proposed single model encapsulates the aspects of BDA adoption and performance. The study makes use of an online questionnaire survey to collect the required data, and the research model is eventually validated based on 132 responses gathered from the retail industry in Jordan. The findings highlight two major observations. The first is that relative advantage, organisational readiness, top management support, government support, data variety and data velocity all have a significant influence over BDA adoption. The second observation is that a significant association exists between BDA adoption and firm performance, providing information on the way firms can enhance their BDA adoption for enhanced performance. This study contributes to literature dedicated to examining BDA in terms of its drivers and impact on performance and can be used as a reference in developing nations.

How To Test The Machine Learning Algorithms That Are Common in Big Data Applications

Chapter

Sep 2023

The extensive use of information and, thereby, also the application of big data (BD) technologies, are some of the biggest influencing factors in today’s society. However, due to the sheer deluge of data, it is not feasible to turn them into usable information in a manual fashion. Instead, automated approaches are required, which makes machine learning (ML) algorithms an important part of the corresponding technical ecosystem. Yet, besides the pure provisioning of the algorithms, it is also necessary to make sure the delivered quality is sufficient. Hence, the testing of the ML algorithms in the BD context with its specific challenges is highly important. For this reason, in the publication at hand, based on previously identified BD standard use cases, the common ML applications are identified and it is discussed, how they can be tested, providing future researchers and practitioners in the domain with valuable insights on how to create better quality BD applications.

Exploring the Test Driven Development of an Information Retrieval System

Conference Paper

Jan 2023

Today’s society is heavily driven by data intensive systems, whose application promises immense benefits. However, this only applies when they are utilized correctly. Yet these types of applications are highly susceptible to errors. Consequently, it is necessary to test them comprehensively and rigorously. One method that has an especially high focus on test coverage is the test driven development (TDD) approach. While it generally has a rather long history, its application in the context of data intensive systems is still somewhat novel. Though, rather recently, a microservice-based test driven development concept has been proposed for the big data domain. The publication at hand explores its feasibility regarding the application in an actual project. For this purpose, a prototypical, microservice based information retrieval system is implemented in a test driven way with particular consideration for scalability.

Factors for the implementation of the circular economy in Big Data environments in service companies in post pandemic times of COVID-19: The case of Colombia

Article

Full-text available

Apr 2023

In emerging economies, Big Data (BD) analytics has become increasingly popular, particularly regarding the opportunities and expected benefits. Such analyzes have identified that the production and consumption of goods and services, while unavoidable, have proven to be unsustainable and inefficient. For this reason, the concept of the circular economy (CE) has emerged strongly as a sustainable approach that contributes to the eco-efficient use of resources. However, to develop a circular economy in DB environments, it is necessary to understand what factors influence the intention to accept its implementation. The main objective of this research was to assess the influence of attitudes, subjective norms, and perceived behavioral norms on the intention to adopt CE in BD-mediated environments. The methodology is quantitative, cross-sectional with a descriptive correlational approach, based on the theory of planned behavior and a Partial Least Squares Structural Equation Model (PLS-SEM). A total of 413 Colombian service SMEs participated in the study. The results show that managers' attitudes, subjective norms, and perceived norms of behavior positively influence the intentions of organizations to implement CB best practices. Furthermore, most organizations have positive intentions toward CE and that these intentions positively influence the adoption of DB; however, the lack of government support and cultural barriers are perceived as the main limitation for its adoption. The research leads to the conclusion that BD helps business and government develop strategies to move toward CE, and that there is a clear positive will and intent toward a more restorative and sustainable corporate strategy.

Exploring the Test Driven Development of a Fraud Detection Application using the Google Cloud Platform

Conference Paper

Jan 2022

The concept of big data hugely impacts today’s society and promises immense benefits when utilized correctly, yet the corresponding applications are highly susceptible to errors. Therefore, testing should be performed as much and rigorous as possible. One of the solutions proposed in the literature is the test driven development (TDD) approach. TDD is a software development approach with a long history but has not been widely applied in the big data domain. Nevertheless, a microservice-based test driven development concept has been proposed in the literature, and the feasibility of applying it in actual projects is explored here. For that, the fraud detection domain has been selected and a proof-of-concept online fraud detection platform is implemented, which processes real-time streaming data and filters fraudulent and legitimate transactions. After the implementation, an evaluation was carried out regarding test coverage and code quality. The automatic code analysis reports reveale d that TDD had produced very reliable, maintainable, and secure code at the first attempt that is ready for production. Finally, the evaluation revealed that it is highly feasible to develop big data applications using the concept mentioned. However, choosing suitable services, tools, frameworks, and code coverage solutions can make it more manageable.

Critical Success Factors for Big Data: A Systematic Literature Review

Article

Full-text available

Jun 2020

During the last few decades, many organizations have started recognizing the benefits of Big Data (BD) to drive their digital transformation and to gain faster insights from faster data. Making smart data-driven decisions will help the organizations to ride the waves toward invaluable investments. The successful implementation of Big Data projects depends on their alignment with the current organizational, technological, and analytical aspects. Identifying the Critical Success Factors (CSFs) for Big Data is fundamental to overcome the challenges surrounding Big Data Analytics (BDA) and implementation. In recent years, the investigations related to identifying the CSFs of Big Data and Big Data Analytics expanded on a large scale trying to address the limitations in existing publications and contribute to the body of knowledge. This paper aims to provide more understanding about the existing CSFs for Big Data Analytics and implementation and contributes to the body of knowledge by answering three research questions: 1) How many studies have investigated on Big Data CSFs for analytics and implementation?, 2) What are the existing CSFs for Big Data Analytics, and 3) What are the categories of Big Data Analytics CSFs?. By conducting a Systematic Literature Review (SLR) for the available studies related to Big Data CSFs in the last twelve years (2007-2019), a final list of sixteen (16) related articles was extracted and analyzed to identify the Big Data Analytics CSFs and their categories. Based on the descriptive qualitative content analysis method for the selected literature, this SLR paper identifies 74 CSFs for Big Data and proposes a classification schema and framework in terms of 5 categories, namely Organization, Technology, People, Data Management, and Governance. The findings of this paper could be used as a referential framework for a successful strategy and implementation of Big Data by formulating more effective data-driven decisions. Future work will investigate the priority of the Big Data CSFs and their categories toward developing a conceptual framework for assessing the success of Big Data projects.

Generating Content-Compliant Training Data in Big Data Education

Conference Paper

Full-text available

Jan 2020

In order to ensure adequate education and training in a statistics-driven field, large sets of content-compliant training data (CCTD) are required. Within the context of practical orientation, such data sets should be as realistic as possible concerning the content in order to improve the learning experience. While there are different data generators for special use cases, the approaches mostly aim at evaluating the performance of database systems. Therefore, they focus on the structure but not on the content. Based on formulated requirements, this paper designs a possible approach for generating CCTD in the context of Big Data education. For this purpose, different Machine Learning algorithms could be utilized. In future work, specific models will be designed, implemented and evaluated.

Towards a Decision Support System for Big Data Projects

Chapter

Full-text available

Mar 2020

Big data has proved to be one of the most promising trends in recent years. However, many challenges and barriers still exist, especially when it comes to the strategic planning and realization of those kinds of projects. Most of all, the selection and combination of the domain–related technologies represents a sophisticated endeavor that increases the complexity of creating a big data system. Hence, it is not surprising that the demand for experts in this area is steadily increasing. To overcome this problem and the related shortage of required knowledge, in the following paper the concept of a decision support system for the selection of appropriate big data technologies is introduced, in order to implement a given project. Through the use of the design science research methodology a preliminary artifact was developed that provides sophisticated recommendations as well as architectural models and blank systems to support the systems engineering procedure.

Exploring the Specificities and Challenges of Testing Big Data Systems

Conference Paper

Full-text available

Nov 2019

Today, the amount and complexity of data that is globally produced increases continuously, surpassing the abilities of traditional approaches. Therefore, to capture and analyze those data, new concepts and techniques are utilized to engineer powerful big data systems. However, despite the existence of sophisticated approaches for the engineering of those systems, the testing is not sufficiently researched. Hence, in this contribution, a comparison of traditional software testing, as a common procedure, and the requirements of big data testing is drawn. The determined specificities in the big data domain are mapped to their implications on the implementation and the consequent challenges. Furthermore, those findings are transferred into six guidelines for the testing of big data systems. In the end, limitations and future prospects are highlighted.

Understanding Issues in Big Data Applications – A Multidimensional Endeavor

Conference Paper

Full-text available

Aug 2019

The amount of data to be produced and analyzed is increasing year by year. As a result, the concept of big data gained interest among researchers and practitioners. However, a plethora of challenges and potentials require the attention from researchers and practitioners to enhance the future development. Apart from the pure processing of the data and its occurring obstacles, also other dimensions need to be considered in this context. This includes the technical planning of the related systems as well as the human interaction with them. When it comes to the strategic design, development, deployment and use of big data systems, especially the aspect of potential issues is often underestimated and less researched. Hence, in this contribution a comprehensive investigation on the various dimensions of big data under a quality assurance perspective is performed. Consequently an overview about the current state of the art and promising solutions are presented, providing a foundation for the future work of practitioners and researchers.

The DSR grid: six core dimensions for effectively planning and communicating design science research projects

Article

Full-text available

Jul 2019

Challenging Big Data Engineering: Positioning of Current and Future Development

Conference Paper

Full-text available

Jan 2019

This contribution examines the terms of big data and big data engineering, considering the specific characteristics and challenges. Deduced by those, it concludes the need for new ways to support the creation of corresponding systems to help big data in reaching its full potential. In the following, the state of the art is analysed and subdomains in the engineering of big data solutions are presented. In the end, a possible concept for filling the identified gap is proposed and future perspectives are highlighted.

Data Mining

Book

Mar 2016

Towards a Decision Support System for Big Data Projects

Chapter

Mar 2010

How Big Data Creates New Job Opportunities: Skill Profiles of Emerging Professional Roles

Chapter

Jan 2020

Big data jobs will increase in importance over the next years. However, at the international level, the labor market for these professionals is characterized by a critical skill shortage. What are the big data specialist profiles that are most sought in the market? What are their main differences in terms of tasks and skill requirements? This chapter provides a snapshot of the most in-demand big data jobs, contributing to clarify their boundaries. It also delves into the main characteristics of the specific professional profiles that have received increasing attention in recent years, namely data scientists and data/business analysts. The review of the contributions provided by experts and scholars operating in the data science and analytics domain clarifies the main differences between these roles on the technical side. However, despite the increasing importance of soft skills, the behavioral competency profile of big data jobs is still ill defined.

Identifying Similarities of Big Data Projects–A Use Case Driven Approach

Abstract and Figures

Recommended publications

Non-Intrusive Load Monitoring Technology The Final Objective of Smart Meter for Efficient Electric N...

Domain-Specific Use Cases for Knowledge-Enabled Social Media Analysis

Blockchain Technology: Is it Hype or Real in the Construction Industry?

Assessing the Application of the In-Memory Technology - A Comprehensive Framework