Conference PaperPDF Available

What is Big Data? A Consensual Definition and a Review of Key Research Topics

September 2014

September 2014

DOI:10.13140/2.1.2341.5048

Conference: 4th International Conference on Integrated Information
At: Madrid

Authors:

Andrea De Mauro

University of Rome Tor Vergata

Marco Greco

Università degli studi di Cassino e del Lazio Meridionale

Michele Grimaldi

University of Cassino and Southern Lazio, Cassino, Italy

Although Big Data is a trending buzzword in both academia and the industry, its meaning is still shrouded by much conceptual vagueness. The term is used to describe a wide range of concepts: from the technological ability to store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in information overload. The lack of a formal definition has led research to evolve into multiple and inconsistent paths. Furthermore, the existing ambiguity among researchers and practitioners undermines an efficient development of the subject. In this paper we have reviewed the existing literature on Big Data and analyzed its previous definitions in order to pursue two results: first, to provide a summary of the key research areas related to the phenomenon, identifying emerging trends and suggesting opportunities for future development; second, to provide a consensual definition for Big Data, by synthesizing common themes of existing works and patterns in previous definitions.

Static tag cloud visualization (word cloud) of key terms appearing in abstracts of Big Data-related papers.

…

Big Data key topics in existing research.

…

Figures - uploaded by Andrea De Mauro

Content may be subject to copyright.

Content uploaded by Andrea De Mauro

Content may be subject to copyright.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

What is Big Data? A Consensual Definition and a Review of

Key Research Topics

Andrea De Mauro1, a), Marco Greco2, b) and Michele Grimaldi2, c)

1Department of Enterprise Engineering, University of Rome Tor Vergata, Via del Politecnico 1, 00133 Roma, Italy

2Department of Civil and Mechanical Engineering, University of Cassino and Southern Lazio, Via Di Biasio 43,

03043 Cassino (FR), Italy

a)Corresponding author: andrea.de.mauro@uniroma2.it

b)m.greco@unicas.it

c)m.grimaldi@unicas.it

Abstract. Although Big Data is a trending buzzword in both academia and the industry, its meaning is still shrouded by

much conceptual vagueness. The term is used to describe a wide range of concepts: from the technological ability to

store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in

information overload. The lack of a formal definition has led research to evolve into multiple and inconsistent paths.

Furthermore, the existing ambiguity among researchers and practitioners undermines an efficient development of the

subject. In this paper we have reviewed the existing literature on Big Data and analyzed its previous definitions in order

to pursue two results: first, to provide a summary of the key research areas related to the phenomenon, identifying

emerging trends and suggesting opportunities for future development; second, to provide a consensual definition for Big

Data, by synthesizing common themes of existing works and patterns in previous definitions.

Keywords: Big Data; Analytics; Information Management; Data Processing; Business Intelligence.

INTRODUCTION

Big Data

has now become a ubiquitous term in many parts of industry and academia. As often happens in these

cases, the frequent utilization of the same words in different contexts poses a threat towards the structured evolution

of its meaning. For this reason it is necessary to invest time and effort in the proposition and the acceptance of a

standard definition of Big Data that would pave the way to its systemic evolution and minimize the confusion

related to its usage. In order to describe Big Data we have decided to start from an “as is” analysis of the contexts in

which the term most frequently appears. Given its remarkable success and its hectic evolution, Big Data possesses

multiple and diverse nuances of meaning, all of which have the right to exist. By analyzing the most significant

occurrences of this term in both academic and business literature we have identified four key themes to which Big

Data refers: Information, Technologies, Methods and Impact. We can reasonably assert that the vast majority of

references to Big Data encompass one of the four themes listed above. Understanding how these themes have been

dealt with in existing literature and how they are mutually interconnected is the objective of the first section of this

paper and is propaedeutic to the attempt of proposing a thorough definition, which is what the second section aims

to provide. We believe that having such a definition will enable a more conscious usage of the term Big Data and a

more coherent development of research on this subject.

We have chosen to capitalize the term ‘Big Data’ throughout this article to clarify that it is the specific subject we are discussing.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

REVIEW OF MAIN RESEARCH TOPICS

This section represents a comprehensive but non-exhaustive review of research topics in the area of Big Data.

We have examined a large number of abstracts of peer-reviewed conference and journal papers and identified

recurring topics by looking at the appearance frequency of top keywords and making an educated guess on their

interrelation. We needed to apply this heuristic approach in order to produce a depiction of the ample range of

concepts related to Big Data while using a relatively small number of topic categories. A systematic literature

review is beyond the scope of this paper and left as an opportunity for future work. The input list of documents was

obtained from Elsevier’s Scopus, a citation database containing more than 50 million records from around 5,000

publishers. On the 3rd of May 2014 we exported a list of 1,581 conference papers and articles that contained the full

term “Big Data” in either the title or within the author-provided keywords

. We have removed those entries where

the abstract text was not available and this left us with a corpus of 1,437 documents. By counting the appearance

frequency of words included in the abstracts we have identified the most recurring items. Figure 1 shows a static tag

cloud visualization (also known as “word cloud”) of the most popular words in the abstracts we analyzed, obtained

through the online tool ManyEyes (Viegas et al. 2007).

By analyzing the most frequent keywords included in Big Data-related abstracts and considering their mutual

relationships we have identified four top research themes in current literature, namely: 1. Information, 2.

Technology, 3. Methods, 4. Impact. We believe that the great majority of papers written on Big Data touch upon one

or more of these four topics. For each of them we will now describe content, trends and enlist a number of relevant

works.

FIGURE 1. Static tag cloud visualization (word cloud) of key terms appearing in abstracts of Big Data-related papers.

The Fuel of Big Data: Information

One of the fundamental reasons for Big Data phenomenon to exist is the current extent to which information can

be generated and made available. Digitization, i.e. the process of converting continuous, analog information into

discrete, digital and machine-readable format, reached broad popularity with the first “mass digitization” projects.

Mass digitization is the attempt to convert entire printed book libraries into digital collections by leveraging optical

character recognition (OCR) software in order to minimize human intervention (Coyle 2006). One of the most

popular attempts of mass digitization was the Google Print Library Project

, started in 2004, that aimed at digitizing

more than 15 million volumes held in multiple university libraries, including Harvard, Stanford and Oxford. More

We have used the following search query in Scopus: “AUTHKEY("Big data") OR TITLE("big data") AND (LIMIT-TO(DOCTYPE, "cp") OR

LIMIT-TO(DOCTYPE, "ar") OR LIMIT-TO(DOCTYPE, "ip"))”

For more information you can visit the Google Books History page, available at http://www.google.com/googlebooks/about/history.html.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

recently it has been proposed a subtle differentiation between digitization and its next step, datafication, i.e. putting a

phenomenon in a quantified format so that it can be tabulated and analyzed (Mayer-Schönberger & Cukier 2013).

The fundamental difference is that digitization enables analog information to be transferred and stored in a more

convenient digital format while datafication aims at organizing digitized version of analog signals in order to

generate insights that would have not been inferred while signals were in their original form. In the case of the

previously cited Google mass digitization effort, the value of datafication came when researchers showed they were

able to provide insights on lexicography, the evolution of grammar, collective memory, the adoption of technology,

the pursuit of fame, censorship, and historical epidemiology by using Google Books’ data (Michel et al. 2011).

Digitization and datafication have become pervasive phenomena thanks to the broad availability of devices that

are both connected and provided with digital sensors. Digital sensors enable digitization while connection lets data

be aggregated and, thus, permits datafication. Cisco estimated that between 2008 and 2009 the number of connected

devices overtook the number of living people (Evans 2011) and, according to Gartner (2014) by 2020 there will be

26 billion devices on earth, more than 3 devices on average per person. The pervasive presence of a variety of

objects (including mobile phones, sensors, Radio-Frequency Identification - RFID - tags, actuators), which are able

to interact with each other and cooperate with their neighbors to reach common goals, goes under the name of the

Internet of Things, IoT (Estrin et al. 2002; Atzori et al. 2010). This increasing availability of sensor-enabled,

connected devices is equipping companies with extensive information assets from which it is possible to create new

business models, improve business processes and reduce costs and risks (Chui et al. 2010). In other words, IoT is

one of the most promising fuels of Big Data expansion.

Another characteristic of the data generated today is its increasing variety in type. Structured data (traditional

text/numeric information) is now joined by unstructured data (audio, video, images, text and human language) and

semistructured data, such as XML and RSS feeds (Russom 2011). The diversity of data types is one of the

challenges that organizations need to tackle in order to make value out of the extensive informational assets

available today (Manyika et al. 2011).

Equipment for Working with Big Data: Technology

The term Big Data is frequently associated with the specific technology that enables its utilization. The extent of

the dataset size and the complexity of operations needed for its processing entail stringent memory storage and

computational performance requirements. According to Google Trends, the most related query to “Big Data” is

“Hadoop” that indeed is the most prominent technology associated with this topic. Hadoop is an open source

framework that enables the distributed processing of big quantities of data by using a group of dispersed machines

and specific computer programming models. The main components of Hadoop are: 1. its file system HDFS, that

allows access to data scattered over multiple machines without having to cope with the complexity inherent to their

dispersed nature; 2. MapReduce, a programming model designed to implement distributed and parallel algorithms in

an efficient way. Both HDFS (Shvachko et al. 2010) and MapReduce (Dean & Ghemawat 2008) are the evolution of

concepts that were originally proposed by Google (Ghemawat et al. 2003) and that were then developed as open-

source projects within Apache’s framework. This proves the centrality of Google in the initiation of the current

thinking about Big Data. The Hadoop framework contains multiple modules and libraries compatible with HDFS

and MapReduce that enable the extension of its applicability to the various needs of coordination, analysis,

performance management and workflow design that normally occur in Big Data applications.

The distributed nature of information requires a specific technological effort for transmitting big quantities of

data and for monitoring the overall system performance using special benchmarking techniques (Xiong et al. 2013).

Another fundamental technological element is the ability to store a bigger quantity of data on smaller physical

devices. Although Moore’s law suggests that storing capacity increases over time in an exponential manner (2006),

still it is required a continuous and expensive research and development effort to keep up with the pace at which

data size increases (Hilbert & López 2011) especially with the growing share of byte-hungry data types such as

images, sounds and videos.

Transforming Big Data in Value: Methods

The analysis of extensive quantities of data and the need to grasp value out of individual behaviors require

processing methods that go beyond the traditional statistical techniques. The knowledge of such methods, of their

potential and, above all, of their limitations requires specific skills that are hard to find in today’s job marketplace.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

Both Manyika et al. (2011) and Chen (2012) propose a list of Big Data Analytical Methods, that include (in

alphabetical order): A/B testing, Association rule learning, Classification, Cluster analysis, Data fusion and data

integration, Ensemble learning, Genetic algorithms, Machine learning, Natural Language Processing, Neural

networks, Network analysis, Pattern recognition, Predictive modelling, Regression, Sentiment Analysis, Signal

Processing, Spatial analysis, Statistics, Supervised and Unsupervised learning, Simulation, Time series analysis and

Visualization.

Chen et al. (2012) evoke the need for companies to invest in Business Intelligence and Analytics education that

would be “interdisciplinary and cover critical analytical and IT skills, business and domain knowledge, and

communication skills required in a complex data-centric business environment”. The investment in analytical

knowledge should be accompanied by a cultural change that would span across all employees and urge them to

“efficiently manage data properly and incorporate them into decision making processes” (Buhl et al. 2013). Mayer-

Schönberger and Cukier (2013) envision the rise of new specific professional entities, called algorithmists, that

would master the areas of computer science, mathematics and statistics and act as “impartial auditors to review the

accuracy or validity of Big Data predictions”. Also Davenport and Patil (2012) describe data scientist as a hybrid of

“data hacker, analyst, communicator, and trusted adviser”, having also the fundamental abilities to write code and

conduct, when needed, academic-style research. These skills are not sufficiently available to meet the increasing

demand: according to Manyika et al. (2011), by the year 2018 there will be a potential shortfall of 1.5 million data-

savvy managers and analysts, in the US only. The analysis of competency gaps and the creation of effective teaching

methods to fill them for both future and current managers and practitioners is a promising research area that has still

much opportunity to grow.

Also the ability of making informed decisions is changing with the expansion of Big Data as the latter implies

the shift from logical, causality-based reasoning to the acknowledgment of correlation links between events. The

utilization of insights generated through Big Data Analytics in companies, universities and institutions provides for

an adaptation to a new culture of decision making (McAfee & Brynjolfsson 2012) and an evolution of the scientific

method (Anderson 2007), both of which are still to be built and provide opportunities for future research.

Being aware of the limitations of Big Data Methods and potential methodological issues is a fundamental

resource for organizations who want to drive data-based decision making: for example, predictions should always be

accompanied by valid confidence intervals in order to avoid the false sense of precision that the apparent

sophistication of some Big Data applications can suggest. Analysts should also be capable of avoiding models’

overfitting that would facilitate apophenia, i.e. the tendency of humans to “see patterns where none actually exist

simply because enormous quantities of data can offer connections that radiate in all directions”, (Boyd & Crawford

2012).

In a summary, Big Data requires the mastery of specific techniques, awareness of their strengths and limitations,

and a spread cultural tendency to informed decision making that in most cases has still to be built.

How Big Data Changes our Lives: Impact

The extent to which Big Data is impacting our society and our companies is often depicted through anecdotes

and success stories of methods and technology implementations. When these stories are accompanied by proposals

of new principles and methodological improvements they represent a valuable contribution to the creation of

knowledge on the subject. The pervasive nature of the current information production and availability leads to many

applications spanning in numerous scientific fields and industry sectors that can be very distant from each other.

Sometimes, the same techniques and data have been applied to solve problems in distant domains. For example,

correlation analysis was leveraged to use logs of Google searches to forecast influenza epidemics (Ginsberg et al.

2009) as well as unemployment (Askitas & Zimmermann 2009) and inflation (Guzman 2011). The existing Big Data

applications are many and expected to grow: hence, their systematic description constitutes a promising

development area for those willing to contribute in the scientific progress in this field.

Big Data can also impact society adversely. In fact, there are multiple concerns arising from the quick

advancement of Big Data (Boyd & Crawford 2012) first being privacy. Although large data sets would normally

proceed from actions done by a multitude of individuals, it is not always true that consequences of using that data

will not impact a single individual in an invasive and/or unexpected way. The identifiability of the individual person

can be avoided through a thorough anonymization of the data set, although it is hard to be fully guaranteed as the

reverse process of de-anonymization can be potentially attempted (Narayanan & Shmatikov 2008). The

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

predictability of future actions, made possible by the analysis of behavioral patterns, poses also the ethical issue of

protecting free will in the future, on top of freedom in the present.

Other issues to be considered are related to the accessibility of information: the exclusive control over data

sources can become an abuse of dominant position and restrict competition by posing unfair entrance barriers to the

marketplace. For example, as Manovich notices (2011), “only social media companies have access to really large

social data – especially transactional data” and they have full control over who can access what information. The

split between information-rich and data-lacking companies can create a new digital divide (Boyd & Crawford 2012)

that can slow down innovation in the sector. Specific policies will have to be promoted and data is likely to become

a new dimension to consider within antitrust regulations.

Not only society but also companies are heavily impacted by the rise of Big Data: the call to arms for acquiring

vital skills and technology to be competitive in a data-driven market implies a serious reconsideration of the firm

organization and the full realm of business processes (Pearson & Wegener 2013). The transformation of data into

competitive advantage (McAfee & Brynjolfsson 2012) is what makes “Big Data” such an impactful revolution in

today’s business world.

FIGURE 2. Big Data key topics in existing research.

A DEFINITION FOR BIG DATA

A convincing definition of a concept is an enabler of its scientific development. As Ronda-Pupo and Guerras-

Martin (2012) suggest, the level of consensus shown by a scientific community on a definition of a concept can be

used as a measure of progress of a discipline. Big Data has instead evolved so quickly and disorderly that such a

universally accepted formal statement denoting its meaning does not exist. There have been many attempts of

definition for Big Data, more or less popular in terms of utilization and citation. However, none of these proposals

has prevented authors of Big Data-related works to extend, renovate or even ignore previous definitions and propose

new ones. Although Big Data is still a relatively young concept, it certainly deserves an accepted vocabulary of

reference that enables the proper development of the discipline among cognoscenti and practitioners.

Information Technology

Methods

Impact

Big Data

Parallel

Computing

Machine

Learning

Programming

Paradigms

Value

Creation

Privacy

Emerging

Skills

Applications

Internet of

Things

Datafication

Distributed

Systems

Storage

Capabilities

Decision

Making

Organizations

Visualization

Overload

Diverse

Unstructured

Society

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

In the first part of this paper we have identified the four main themes of Big Data and we have observed that they

are the prevalent topics in the existing literature. In the next paragraphs we will review a non-exhaustive list of

previously proposed Big Data definitions and we will conceptually tie them to the aforementioned four themes of

research. After considering the existing definitions and analyzing their commonalities we will propose a consensual

definition of Big Data. Consensus in this case comes from the acknowledgement of centrality of some recurring

attributes associated to Big Data, and from the assumption that they define the essence of what Big Data means to

scholars and practitioners today. We expect that such a definition would be less prone to attack from previous

definitions’ authors and users as it is based on the most central aspects associated until now to Big Data.

A thorough consensus analysis based on Cohen’s K coefficient (1960) and co-word analysis, as in (Ronda-Pupo

& Guerras-Martin 2012), goes beyond the scope of this work and is left for future study.

Survey of Existing Definitions

Big Data has been often described “implicitly” through success stories or anecdotes, characteristics,

technological features, emerging trends or its impact to society, organizations and business processes. In the existing

attempts of explicit definitions for Big Data there is not even an agreement on what entity this term can be

associated with. We have found that Big Data is used when referring to a variety of different entities including - but

not limited to - social phenomenon, information assets, data sets, analytical techniques, storage technologies,

processes and infrastructures. We have surveyed multiple definitions that have been proposed to date and listed them

in Tab. 1: in this paragraph we will go through the most notable ones.

A first group of Big Data definitions focuses on enlisting its characteristics. What is probably the most popular

definition falls within this group. When presenting the Data Management challenges that companies had to face in

response to the rise of e-commerce in the early 2000’s, Laney introduces a framework expressing the 3-dimensional

increase in data Volume, Velocity and Variety and invokes the need for new formal practices that will imply

“tradeoffs and architectural solutions that involve/impact application portfolios and business strategy decisions”

(2001). Although this work did not mention Big Data explicitly, the model, later nicknamed as “the 3 V’s”, was

associated to the concept of Big Data and used as its definition (Beyer & Laney 2012; Eaton et al. 2012; Zaslavsky

et al. 2013). Many other authors extended the “3 V’s” model and, as a result, multiple features of Big Data, like

Value (Dijcks 2012), Veracity (Schroeck et al. 2012), Complexity and Unstructuredness (Intel 2012; Suthaharan

2013), were added to the list.

A second group of definitions emphasizes the technological needs behind the processing of large amounts of

data. According to Microsoft, Big Data is about applying “serious computing power” to massive sets of information

(2013) and also the National Institute of Standards and Technology (NIST) highlights the need for a “scalable

architecture for efficient storage, manipulation, and analysis” when defining Big Data (2014).

A few definitions associate Big Data to the crossing of some sort of threshold: for instance Dumbill (2013)

asserts that data is Big when it “exceeds the processing capacity of conventional database systems” and requires the

choice of “an alternative way to process it”. Fisher (2012) acknowledges that the size that constitutes “big” has

grown according to Moore’s Law and links the absolute level of this threshold to the capacity of commercial storing

solutions: Big Data “is so large as to not fit on a single hard drive” and, hence “will be stored on several different

disks”.

A last group of definitions highlights the impact of Big Data advancement on society. Boyd and Crawford (2012)

notice that “Big Data is less about data that is big than it is about a capacity to search, aggregate, and cross-

reference large data sets”. They define Big Data as “a cultural, technological, and scholarly phenomenon” that rests

on the interplay of Technology (maximizing computation power and algorithmic accuracy), Analysis (to identify

patterns on large data sets) and Mythology (meaning the belief that large data sets offer a higher form of intelligence

with an aura of truth, objectivity and accuracy). Mayer-Schönberger and Cukier (2013) describe Big Data by

enlisting the three key “shifts in the way we analyze information that transform how we understand and organize

society”: 1. “More data”, in terms of “completeness” of the data set, using all of available data instead of a sample

of it; 2.”More messy”, meaning that we can loosen up on our desire for exactitude and use also incomplete or less

accurate input data; 3. “Correlation” becomes more important and overtakes “causality” as a way to make sense of

trends and finally make decisions.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

TABLE 1. Existing definitions of Big Data, adapted from the articles referenced in the first column. The last four columns

indicate whether the definition alludes to each of the four Big Data themes identified in the first section of the paper, through the

following legend: I - Information, T - Technology, M - Methods, P - Impact.

Source

Definition

(Beyer & Laney 2012)

High volume, velocity and variety information assets that

demand cost-effective, innovative forms of information

processing for enhanced insight and decision making.

(Dijcks 2012)

The four characteristics defining big data are Volume,

Velocity, Variety and Value.

(Intel 2012)

Complex, unstructured, or large amounts of data.

(Suthaharan 2013)

Can be defined using three data characteristics: Cardinality,

Continuity and Complexity.

(Schroeck et al. 2012)

Big data is a combination of Volume, Variety, Velocity and

Veracity that creates an opportunity for organizations to gain

competitive advantage in today’s digitized marketplace.

(NIST Big Data Public

Working Group 2014)

Extensive datasets, primarily in the characteristics of volume,

velocity and/or variety, that require a scalable architecture for

efficient storage, manipulation, and analysis.

(Ward & Barker 2013)

The storage and analysis of large and or complex data sets

using a series of techniques including, but not limited to:

NoSQL, MapReduce and machine learning.

(Microsoft 2013)

The process of applying serious computing power, the latest

in machine learning and artificial intelligence, to seriously

massive and often highly complex sets of information.

(Dumbill 2013)

Data that exceeds the processing capacity of conventional

database systems.

(Fisher et al. 2012)

Data that cannot be handled and processed in a

straightforward manner.

(Shneiderman 2008)

A dataset that is too big to fit on a screen.

(Manyika et al. 2011)

Datasets whose size is beyond the ability of typical database

software tools to capture, store, manage, and analyze.

(Chen et al. 2012)

The data sets and analytical techniques in applications that are

so large and complex that they require advanced and unique

data storage, management, analysis, and visualization

technologies.

(Boyd & Crawford

2012)

A cultural, technological, and scholarly phenomenon that rests

on the interplay of Technology, Analysis and Mythology.

(Mayer-Schönberger &

Cukier 2013)

Phenomenon that brings three key shifts in the way we

analyze information that transform how we understand and

organize society: 1. More data, 2. Messier (incomplete) data,

3. Correlation overtakes causality.

Consensual Definition

By looking at both the existing definitions of Big Data and at the main research topics associated to it, we can

affirm that the nucleus of the concept of Big Data can be expressed by:

 ‘Volume’, ‘Velocity’ and ‘Variety’, to describe the characteristics of Information involved;

 Specific ‘Technology’ and ‘Analytical Methods’, to clarify the unique requirements strictly needed to

make use of such Information;

 Transformation into insights and consequent creation of economic ‘Value’, as the principal way Big

Data is impacting companies and society.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

We believe that the “object” to which Big Data should refer to in its definition is ‘Information assets’, as this

entity is clearly identifiable and is not dependent on the field of application.

Therefore, we propose the following formal definition:

“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to

require specific Technology and Analytical Methods for its transformation into Value.”

Such a definition of Big Data is compatible with the existence of terms like “Big Data Technology” and “Big

Data Methods” that should be used when referring directly to the specific technology and methods mentioned in the

main definition.

CONCLUSION

Big Data has recently become a voguish term among researchers and IT professionals. Its success is propelled by

a frequent utilization in a broad range of contexts and with several, and often incongruous, acceptations. As a result,

its meaning is still nebulous and this hinders an organized evolution of the subject.

We have conducted an analysis of the usage of this term in literature and concluded that the top four themes

associated to Big Data are: Information, Technology, Methods and Impact. We have then have suggested a

definition that is coherent with the current “as is” utilization of the term and consensual with the most prominent

definitions that have been so far proposed. We suggest using Big Data as a standalone term when referring to those

“Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and

Analytical Methods for its transformation into Value” and as an attribute when denoting its peculiar requisites, e.g.

“Big Data Technology” or “Big Data Analytical Methods”. We believe that using this definition from now on will

allow a more efficient scientific development of the matter.

Possible extensions to the present work include:

 A systematic literature review of “Big Data” by means of quantitative methods, such as co-word,

cluster and frequency analysis. The review should also identify a more granular list of research topics

through systemic methods like topic modeling.

 Study of how Big Data is systematically impacting on the creation of economic value in companies and

a proposal of guidelines for a coherent development of system and processes related to Business

Intelligence and Analytics. We can presume that the value creation chain would go through the four

themes of Big Data and that maximizing the value each component brings would generate higher

returns on BI&A investments.

REFERENCES

Anderson, C., 2007. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired, p.3.

Askitas, N. & Zimmermann, K.F., 2009. Google Econometrics and Unemployment Forecasting. Applied Economics

Quarterly, 55(2), pp.107–120.

Atzori, L., Iera, A. & Morabito, G., 2010. The Internet of Things: A survey. Computer Networks, 54(15), pp.2787–

2805.

Beyer, M.A. & Laney, D., 2012. The Importance of “Big Data”: A Definition. Gartner Publications, pp.1–9.

Boyd, D. & Crawford, K., 2012. Critical Questions for Big Data. Information, Communication & Society, 15(5),

pp.662–679.

Buhl, H.U. et al., 2013. Big Data. Business & Information Systems Engineering, 5(2), pp.65–69.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

Chen, H., Chiang, R. & Storey, V., 2012. Business Intelligence and Analytics: From Big Data to Big Impact. MIS

Quarterly, 36(4), pp.1165–1188.

Chui, M., Löffler, M. & Roberts, R., 2010. The Internet of things. McKinsey Quarterly, 291(2), p.10.

Cohen, J., 1960. A coefficient of agreement of nominal scales. Educational and Psychological Measurement, 20,

pp.37–46.

Coyle, K., 2006. Mass Digitization of Books. Journal of Academic Librarianship, 32(6), pp.641–645.

Davenport, T.H. & Patil, D.J., 2012. Data Scientist: The Sexiest Job Of the 21st Century. Harvard Business Review,

90(10), pp.70–76.

Dean, J. & Ghemawat, S., 2008. MapReduce: Simplified Data Processing on Large Clusters. Communications of the

ACM, 51(1), pp.1–13.

Dijcks, J., 2012. Oracle: Big data for the enterprise. Oracle White Paper, (June).

Dumbill, E., 2013. Making Sense of Big Data. Big Data.

Eaton, C. et al., 2012. Understanding Big Data, McGraw-Hill Companies.

Estrin, D. et al., 2002. Connecting the physical world with pervasive networks. IEEE Pervasive Computing, 1(1),

pp.59–69.

Evans, D., 2011. The Internet of Things - How the Next Evolution of the Internet is Changing Everything. CISCO

white paper, (April), pp.1–11.

Fisher, D. et al., 2012. Interactions with Big Data Analytics. interactions.

Gartner, 2014. Gartner Says the Internet of Things Will Transform the Data Center.

Ghemawat, S., Gobioff, H. & Leung, S.-T., 2003. The Google file system. ACM SIGOPS Operating Systems

Review, 37(5), p.29.

Ginsberg, J. et al., 2009. Detecting influenza epidemics using search engine query data. Nature, 457(7232),

pp.1012–1014.

Guzman, G., 2011. Internet search behavior as an economic forecasting tool: The case of inflation expectations.

Journal of economic and social measurement, 36(3), pp.119–167.

Hilbert, M. & López, P., 2011. The world’s technological capacity to store, communicate, and compute information.

Science (New York, N.Y.), 332(6025), pp.60–65.

Intel, 2012. Big Data Analytics. Intel’s IT Manager Survey on How Organizations Are Using Big Data.

Laney, D., 2001. 3D data management: Controlling data volume, velocity and variety. META Group Research Note,

6 (February 2001).

Manovich, L., 2011. Trending: The Promises and the Challenges of Big Social Data. Debates in the Digital

Humanities, pp.1–10.

Preprint copy

Presented at “4th International Conference on Integrated Information”.

Accepted for publication in “AIP Proceedings”, exp. end 2014.

Manyika, J. et al., 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global

Institute.

Mayer-Schönberger, V. & Cukier, K., 2013. Big Data: A Revolution That Will Transform How We Live. Work and

Think, London: John Murray.

McAfee, A. & Brynjolfsson, E., 2012. Big data: the management revolution. Harvard business review, (October

2012).

Michel, J.-B. et al., 2011. Quantitative analysis of culture using millions of digitized books. Science (New York,

N.Y.), 331(6014), pp.176–82.

Microsoft, 2013. The Big Bang: How the Big Data Explosion Is Changing the World.

Moore, G.E., 2006. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38,

number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Newsletter, 20(3).

Narayanan, A. & Shmatikov, V., 2008. Robust de-anonymization of large sparse datasets. In Proceedings - IEEE

Symposium on Security and Privacy. pp. 111–125.

NIST Big Data Public Working Group, 2014. Big Data Interoperability Framework: Definitions (draft).

Pearson, T. & Wegener, R., 2013. Big Data: The organizational challenge.

Ronda-Pupo, G.A. & Guerras-Martin, L.Á., 2012. Dynamics of the evolution of the strategy concept 1962-2008: a

co-word analysis. Strategic Management Journal, 33(2), pp.162–188.

Russom, P., 2011. Big data analytics. TDWI Best Practices Report, Fourth Quarter.

Schroeck, M. et al., 2012. Analytics: The real-world use of big data.

Shneiderman, B., 2008. Extreme visualization: squeezing a billion records into a million pixels. International

conference on Management of data, pp.3–12.

Shvachko, K. et al., 2010. The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage

Systems and Technologies, MSST2010.

Suthaharan, S., 2013. Big Data Classification: Problems and challenges in network intrusion prediction with

machine learning. Big Data Analytics workshop.

Viegas, F.B. et al., 2007. Many Eyes: A site for visualization at internet scale. IEEE Transactions on Visualization

and Computer Graphics, 13(6), pp.1121–1128.

Ward, J. & Barker, A., 2013. Undefined By Data: A Survey of Big Data Definitions. arXiv preprint

arXiv:1309.5821.

Xiong, W. et al., 2013. A characterization of big data benchmarks. In Big Data, 2013 IEEE International

Conference on. pp. 118–125.

Zaslavsky, A., Perera, C. & Georgakopoulos, D., 2013. Sensing as a service and big data. arXiv preprint.

Metaphors of Digital Transformation

Chapter

Feb 2024

This book offers an up-to-date collection of works on metaphors in the area of organization studies. The mission of the book is to increase the interest for metaphor-based and metaphor-oriented research within the area of organization studies; further knowledge on “metaphor” can help to increase the potential of metaphor in organization studies. The book acknowledges the usability of metaphor and metaphors in the area of organization studies and also acknowledges the existence of, explores, and suggests solutions to challenges that metaphor use comes with, in order to stimulate further use of both metaphor and metaphors in organization studies. The book is an effort to offer a smorgasbord of current (at the beginning of the 2020s) works that in one way or another use metaphor or metaphors in the study of organizing and organizations. Some of the contributors are well established within the area of metaphor in organization studies, many of them having played major roles in the development of the field throughout the years. Others have more recently begun using metaphor and metaphors in their studies of organizations and organizing. The book contains, within the area of organization studies—broadly defined—chapters offering theoretical considerations on metaphor; chapters exemplifying the use of metaphors in organization studies; chapters discussing methods for using metaphors in research; chapters dealing with the use of metaphors in teaching as well as in practice; and chapters offering various perspectives on metaphor.

Finding the forgotten spaces: Using a social-ecological framework to map informal green space in Melbourne, Australia

Article

Jun 2024
LAND USE POLICY

Towards the Adoption of Industry 4.0 Technologies in the Digitalization of Manufacturing Supply Chain

Article

Jan 2024

The study used multiple case studies to explore the findings from literature review on adoption of industry 4.0 technologies in manufacturing supply chains. It displays a digitalized supply chain as one of the best options for optimization of manufacturing companies processes and provides insights and some guidance on the industry 4.0 technologies for manufacturing companies to prioritize when starting the digitalization journey; to improve decision making, maximize efficiency and minimize costs. The main objective of the study is to explore various industry 4.0 technologies used in manufacturing supply chains and two propositions were suggested based on the three case companies investigated. The digitalization of manufacturing supply chains has an overall positive impact on how the supply chains operates and improves productivity and growth. It was concluded that industry 4.0 technologies are valuable tools from a managerial perspective, because they provide better process visibility and tracking of requisitions, improved efficiency, optimization of resources, easy to use templates, improved access to ordering data and reporting, improved decision making, and the supply chains are more autonomous.

Data-Driven Decision Making: Participating Big Data Analytics in Business Management

Conference Paper

Dec 2023

The Importance of the Application of the Big Data Concept for Small and Medium-Sized Enterprises

Conference Paper

Oct 2023

Business operations of companies in modern conditions are subject to enormous market, social and especially technological pressures from the environment. Information and communication technologies have become so incorporated both in our everyday life and in the operations of every company, that without them we feel almost lost and helpless. Big Data, as a theoretical (philosophical) concept has existed for decades, but only recently, thanks to the extraordinarily rapid development of information and communication technologies, it has become applicable in practice, and as a business concept it has been recognized as a unique opportunity for success in the business world. Like all organizations, small and medium-sized enterprises can find a unique opportunity to improve their own business in the application of this concept. The number of users is growing exponentially, generating a huge amount of different data every second through different sources (YouTube, Twitter, Instagram, Facebook, Google, Skype, Internet, E-mail). All those unimaginably large amounts of data need to be stored somewhere: processed, analyzed, presented and interpreted, and then propose (suggest) specific business solutions based on those results. Realization of those activities in real or reasonable time, and often unexpected and surprising conclusions, are made possible by the Big Data concept. This article aims to shed light on the concept and technology of Big Data and its application at the level of small and medium enterprises. Big Data is a theoretical and technological concept, which is able to revolutionize the way of decision-making in companies and achieve extraordinary and concrete results. A secondary, but no less important, goal of writing this paper is to point out the importance of small and medium-sized enterprises, which outnumber the large ones. Most of them strive for a stable, dominant and high market position, so it can be concluded that they are extremely important for development and progress of each country.

Pemasaran Berbasis Big Data Dalam Revolusi Industri 4.0: Sebuah Perspektif Etika Bisnis

Preprint

Full-text available

Dec 2023

Impact of green innovation on carbon reduction in China

Article

Full-text available

Jun 2024

Green innovation directly encompasses the two major concepts of green and innovation in the new development concepts, which provides a powerful driving force to support Chinese-style modernisation. This paper empirically tests the relationship between green innovation and carbon emission intensity using a double fixed effects model. Based on the panel data of 30 provinces in China, the mediation effect model of “green innovation-big data-carbon emission” is constructed. The result shows that green innovation has a noticeable direct negative effect on urban carbon emission intensity. The conclusions are robust after considering measurement errors and endogenous problems. Furthermore, it is found that big data plays a significant role in strengthening the relationship between green innovation and carbon emission intensity. The findings in this study not only advance the study on green innovation and carbon emissions but also provide a new perspective on the role of big data.

فرص وتحدیات الاستفادة من البیانات الضخمة في أنشطة المصارف المركزية العربية

Article

Full-text available

Mar 2024

The significant rise in data volume and frequency, accompanied by the emergence of sophisticated technologies like artificial intelligence and machine learning for data analysis, has sparked a notable transformation across multiple sectors, particularly within the domain of central banking. Globally, many central banks are increasingly inclined towards harnessing the potential of big data to streamline their functions and strengthen decision-making processes. This study aims to spotlight the potential uses of big data within the operations of central banks, emphasizing both the opportunities and obstacles linked to its implementation. Additionally, it aims to assess the actual situation and hurdles in applying big data within Arab central banks by utilizing a questionnaire directed specifically to these institutions. The questionnaire sought to elicit the perspectives of Arab central banks on big data, shedding light on the most significant opportunities and challenges in this domain. Responses were gathered from eleven central banks situated in Jordan, the Emirates, Bahrain, Saudi Arabia, Sudan, Iraq, Qatar, Lebanon, Libya, Morocco, and Yemen. The study's results indicate that the majority of Arab central banks lack a clear comprehension and strategy for employing big data, showcasing diverse levels of consideration and application. These banks, however, foresee substantial benefits from embracing big data, especially in fraud detection, early warning system structuring, extensive regulatory supervision, utilizing Regulatory Technology (RegTech), and combating money laundering and terrorism financing. Yet, these banks face notable hurdles, including a scarcity of skilled professionals, managing vast data volumes, ensuring legal compliance, and prioritizing privacy protection, which remains paramount. Addressing these challenges requires a focused approach, including skill development, refining data management protocols, and bolstering cybersecurity measures within the organizational setup. Furthermore, fostering cooperation at local, regional, and global levels emerges as crucial for knowledge. The study emphasizes that while Arab central banks acknowledge big data's potential, significant impediments hinder its practical application. Therefore, overcoming these implements is essential to effectively leverage the potential of big data. To address these challenges, the study recommends increased investment in big data infrastructure and talent development, the establishment of robust governance frameworks for big data, and promoting cooperation to exchange knowledge and good practices. Implementing these measures is poised to empower Arab central banks, enhancing their operational efficiency and enabling them to execute their tasks in a more effective manner. Keywords: big data, central banks, monetary policy, financial stability, regulatory oversight, data privacy, cybersecurity, artificial intelligence, machine learning.

On the Role of Big Data and Artificial Intelligence for the Sustainability of Complex Logistics Networks of Offshore Companies

Chapter

Mar 2024

Corporate logistics network data is a complex system made up of a large number of interacting entities, used to control, implement and plan the transport and distribution of goods and raw materials. They are essential to supply chain management and subsequently to customer satisfaction. Today, international freight transport is one of the main pillars of world trade. Its importance lies in the diversity of studies and analyses that have been carried out on this type of transport, covering all aspects of international trade - administrative, political, technical and economic - with the aim of developing and modernising this mode of transport. Developing and modernising this logistics network has therefore become a major objective for players in the global economy, given the rapid pace of technological, IT and scientific development. However, the use of intelligent and optimisation techniques in logistics networks has always faced constraints in terms of application and decision-making, with the volume of goods transported increasing every year, leading to new challenges for researchers and operators in this area of international trade. With this in mind, the aim of this literature review is to outline intelligent solutions based on artificial intelligence (AI) and Big Data (BD), in relation to logistics networks, while reviewing the state of the art of recently published studies that offer perspectives with the use of AI for the sustainability of logistics networks of offshore companies.

El dato personal como insumo productivo en la actividad empresarial y su protección jurídica

Article

Full-text available

Dec 2023

María Cristina Quintero Riveros

No son novedosas aquellas inquietudes que han surgido con relación al uso y aprovechamiento que realizan las empresas de los datos de sus usuarios y consumidores. Mayoritariamente la regulación se ha desarrollado desde una perspectiva constitucional, orientada a la protección a los derechos fundamentales de los originadores de los datos. Esta situación parece desconocer de manera alguna la importancia que estos datos tienen de cara a la estructuración de un modelo de negocio. Por tal motivo, este trabajo estudia de una parte los tipos y naturaleza de los datos que pueden recolectarse y usarse para fines empresariales, e igualmente su importancia para los empresarios. De otro lado, se analiza la coherencia del tratamiento jurídico que se da a los datos en función de su naturaleza y uso, así como se persigue establecer si desde el derecho privado se pueden proteger los datos personales como insumo de la actividad empresarial. Para el desarrollo de este trabajo se utilizó una metodología de investigación socio jurídica con un énfasis principalmente cualitativo, concretado en el análisis de fuentes legales, jurisprudenciales y doctrinarias, así como la elaboración de estudios de campo. A partir de esta investigación se identificaron dinámicas de mercantilización del dato personal, que hacen visible la necesidad de asignarles un tratamiento jurídico ius privatista al uso y aprovechamiento de estos por parte del empresario. Asimismo, se distinguió el ámbito de protección de las normas de protección de datos vigentes y las situaciones que escapan a dicha órbita, siendo merecedoras de un tratamiento jurídico acorde con la realidad empresarial. Estos hallazgos permitieron visibilizar esta circunstancia y enfatizar en la necesidad de observar los fenómenos económicos y sociales presentes en los distintos estamentos de la sociedad, a fin de valorar la necesidad, efectividad y coherencia de la regulación vigente.

Business Intelligence and Analytics: From Big Data to Big Impact

Article

Full-text available

Dec 2012

Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework.

The Internet of Things: How the Next Evolution of the Internet is Changing Everything

Article

Jan 2011

D. Evans

The Google file system

Article

Jan 2003

MapReduce: Simplified data processing on large clusters

Article

Jan 2004

Big data: The next frontier for innovation, competition, and productivity

Technical Report

May 2011

3-D Data Management: Controlling Data Volume, Velocity, and Variety

Article

Jan 2001

Doug Laney

Big data: The management revolution

Article

Jan 2012
HARVARD BUS REV

Article

Jan 2012

Lev Manovich

The term Big Data is applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set. This chapter addresses some of the theoretical and practical issues raised by the possibility of using massive amounts of social and cultural data in the humanities and social sciences. These observations are based on the author’s own experience working since 2007 with large cultural data sets at the Software Studies Initiative at the University of California, San Diego. The issues discussed include the differences between ‘deep data’ about a few people and ‘surface data’ about many people; getting access to transactional data; and the new “data analysis divide” between data experts and researchers without training in computer science.

Big data classification: Problems and challenges in network intrusion prediction with machine learning

Conference Paper

Jan 2014

Shan Suthaharan

The internet of things

Article

Jan 2010

The pathways through which information is gathered, stored, and dispatched in organizations are changing: the physical world itself is becoming a type of information system. In what's called the Internet of Things, sensors and actuators embedded in physical objectsfrom roadways to pacemakersare linked through wired and wireless networks, often using the same Internet Protocol (IP) that connects the Internet. These networks churn out huge volumes of data as they sense the environment and as devices communicate with one another. The data can be marshaled to aid decision makers or to respond automatically to events. These emerging information networks promise to change business models for many companies, offering new ways to interact with consumers, fine-tune processes for greater productivity, automate dangerous tasks, and better manage risk.

What is Big Data? A Consensual Definition and a Review of Key Research Topics

Abstract and Figures

Recommended publications

What is big data? A consensual definition and a review of key research topics

A formal definition of Big Data based on its essential features

Human resources for Big Data professions: A systematic classification of job roles and required skil...

Big Data in Policing: Profiling, Patterns, and Out of the Box Thinking