Conference PaperPDF Available

What is Big Data? A Consensual Definition and a Review of Key Research Topics

Authors:

Abstract and Figures

Although Big Data is a trending buzzword in both academia and the industry, its meaning is still shrouded by much conceptual vagueness. The term is used to describe a wide range of concepts: from the technological ability to store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in information overload. The lack of a formal definition has led research to evolve into multiple and inconsistent paths. Furthermore, the existing ambiguity among researchers and practitioners undermines an efficient development of the subject. In this paper we have reviewed the existing literature on Big Data and analyzed its previous definitions in order to pursue two results: first, to provide a summary of the key research areas related to the phenomenon, identifying emerging trends and suggesting opportunities for future development; second, to provide a consensual definition for Big Data, by synthesizing common themes of existing works and patterns in previous definitions.
Content may be subject to copyright.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
What is Big Data? A Consensual Definition and a Review of
Key Research Topics
Andrea De Mauro1, a), Marco Greco2, b) and Michele Grimaldi2, c)
1Department of Enterprise Engineering, University of Rome Tor Vergata, Via del Politecnico 1, 00133 Roma, Italy
2Department of Civil and Mechanical Engineering, University of Cassino and Southern Lazio, Via Di Biasio 43,
03043 Cassino (FR), Italy
a)Corresponding author: andrea.de.mauro@uniroma2.it
b)m.greco@unicas.it
c)m.grimaldi@unicas.it
Abstract. Although Big Data is a trending buzzword in both academia and the industry, its meaning is still shrouded by
much conceptual vagueness. The term is used to describe a wide range of concepts: from the technological ability to
store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in
information overload. The lack of a formal definition has led research to evolve into multiple and inconsistent paths.
Furthermore, the existing ambiguity among researchers and practitioners undermines an efficient development of the
subject. In this paper we have reviewed the existing literature on Big Data and analyzed its previous definitions in order
to pursue two results: first, to provide a summary of the key research areas related to the phenomenon, identifying
emerging trends and suggesting opportunities for future development; second, to provide a consensual definition for Big
Data, by synthesizing common themes of existing works and patterns in previous definitions.
Keywords: Big Data; Analytics; Information Management; Data Processing; Business Intelligence.
INTRODUCTION
Big Data
1
has now become a ubiquitous term in many parts of industry and academia. As often happens in these
cases, the frequent utilization of the same words in different contexts poses a threat towards the structured evolution
of its meaning. For this reason it is necessary to invest time and effort in the proposition and the acceptance of a
standard definition of Big Data that would pave the way to its systemic evolution and minimize the confusion
related to its usage. In order to describe Big Data we have decided to start from an “as is” analysis of the contexts in
which the term most frequently appears. Given its remarkable success and its hectic evolution, Big Data possesses
multiple and diverse nuances of meaning, all of which have the right to exist. By analyzing the most significant
occurrences of this term in both academic and business literature we have identified four key themes to which Big
Data refers: Information, Technologies, Methods and Impact. We can reasonably assert that the vast majority of
references to Big Data encompass one of the four themes listed above. Understanding how these themes have been
dealt with in existing literature and how they are mutually interconnected is the objective of the first section of this
paper and is propaedeutic to the attempt of proposing a thorough definition, which is what the second section aims
to provide. We believe that having such a definition will enable a more conscious usage of the term Big Data and a
more coherent development of research on this subject.
1
We have chosen to capitalize the term ‘Big Data’ throughout this article to clarify that it is the specific subject we are discussing.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
REVIEW OF MAIN RESEARCH TOPICS
This section represents a comprehensive but non-exhaustive review of research topics in the area of Big Data.
We have examined a large number of abstracts of peer-reviewed conference and journal papers and identified
recurring topics by looking at the appearance frequency of top keywords and making an educated guess on their
interrelation. We needed to apply this heuristic approach in order to produce a depiction of the ample range of
concepts related to Big Data while using a relatively small number of topic categories. A systematic literature
review is beyond the scope of this paper and left as an opportunity for future work. The input list of documents was
obtained from Elsevier’s Scopus, a citation database containing more than 50 million records from around 5,000
publishers. On the 3rd of May 2014 we exported a list of 1,581 conference papers and articles that contained the full
term “Big Data” in either the title or within the author-provided keywords
2
. We have removed those entries where
the abstract text was not available and this left us with a corpus of 1,437 documents. By counting the appearance
frequency of words included in the abstracts we have identified the most recurring items. Figure 1 shows a static tag
cloud visualization (also known as “word cloud”) of the most popular words in the abstracts we analyzed, obtained
through the online tool ManyEyes (Viegas et al. 2007).
By analyzing the most frequent keywords included in Big Data-related abstracts and considering their mutual
relationships we have identified four top research themes in current literature, namely: 1. Information, 2.
Technology, 3. Methods, 4. Impact. We believe that the great majority of papers written on Big Data touch upon one
or more of these four topics. For each of them we will now describe content, trends and enlist a number of relevant
works.
FIGURE 1. Static tag cloud visualization (word cloud) of key terms appearing in abstracts of Big Data-related papers.
The Fuel of Big Data: Information
One of the fundamental reasons for Big Data phenomenon to exist is the current extent to which information can
be generated and made available. Digitization, i.e. the process of converting continuous, analog information into
discrete, digital and machine-readable format, reached broad popularity with the first “mass digitization” projects.
Mass digitization is the attempt to convert entire printed book libraries into digital collections by leveraging optical
character recognition (OCR) software in order to minimize human intervention (Coyle 2006). One of the most
popular attempts of mass digitization was the Google Print Library Project
3
, started in 2004, that aimed at digitizing
more than 15 million volumes held in multiple university libraries, including Harvard, Stanford and Oxford. More
2
We have used the following search query in Scopus: “AUTHKEY("Big data") OR TITLE("big data") AND (LIMIT-TO(DOCTYPE, "cp") OR
LIMIT-TO(DOCTYPE, "ar") OR LIMIT-TO(DOCTYPE, "ip"))”
3
For more information you can visit the Google Books History page, available at http://www.google.com/googlebooks/about/history.html.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
recently it has been proposed a subtle differentiation between digitization and its next step, datafication, i.e. putting a
phenomenon in a quantified format so that it can be tabulated and analyzed (Mayer-Schönberger & Cukier 2013).
The fundamental difference is that digitization enables analog information to be transferred and stored in a more
convenient digital format while datafication aims at organizing digitized version of analog signals in order to
generate insights that would have not been inferred while signals were in their original form. In the case of the
previously cited Google mass digitization effort, the value of datafication came when researchers showed they were
able to provide insights on lexicography, the evolution of grammar, collective memory, the adoption of technology,
the pursuit of fame, censorship, and historical epidemiology by using Google Books’ data (Michel et al. 2011).
Digitization and datafication have become pervasive phenomena thanks to the broad availability of devices that
are both connected and provided with digital sensors. Digital sensors enable digitization while connection lets data
be aggregated and, thus, permits datafication. Cisco estimated that between 2008 and 2009 the number of connected
devices overtook the number of living people (Evans 2011) and, according to Gartner (2014) by 2020 there will be
26 billion devices on earth, more than 3 devices on average per person. The pervasive presence of a variety of
objects (including mobile phones, sensors, Radio-Frequency Identification - RFID - tags, actuators), which are able
to interact with each other and cooperate with their neighbors to reach common goals, goes under the name of the
Internet of Things, IoT (Estrin et al. 2002; Atzori et al. 2010). This increasing availability of sensor-enabled,
connected devices is equipping companies with extensive information assets from which it is possible to create new
business models, improve business processes and reduce costs and risks (Chui et al. 2010). In other words, IoT is
one of the most promising fuels of Big Data expansion.
Another characteristic of the data generated today is its increasing variety in type. Structured data (traditional
text/numeric information) is now joined by unstructured data (audio, video, images, text and human language) and
semistructured data, such as XML and RSS feeds (Russom 2011). The diversity of data types is one of the
challenges that organizations need to tackle in order to make value out of the extensive informational assets
available today (Manyika et al. 2011).
Equipment for Working with Big Data: Technology
The term Big Data is frequently associated with the specific technology that enables its utilization. The extent of
the dataset size and the complexity of operations needed for its processing entail stringent memory storage and
computational performance requirements. According to Google Trends, the most related query to “Big Data” is
“Hadoop” that indeed is the most prominent technology associated with this topic. Hadoop is an open source
framework that enables the distributed processing of big quantities of data by using a group of dispersed machines
and specific computer programming models. The main components of Hadoop are: 1. its file system HDFS, that
allows access to data scattered over multiple machines without having to cope with the complexity inherent to their
dispersed nature; 2. MapReduce, a programming model designed to implement distributed and parallel algorithms in
an efficient way. Both HDFS (Shvachko et al. 2010) and MapReduce (Dean & Ghemawat 2008) are the evolution of
concepts that were originally proposed by Google (Ghemawat et al. 2003) and that were then developed as open-
source projects within Apache’s framework. This proves the centrality of Google in the initiation of the current
thinking about Big Data. The Hadoop framework contains multiple modules and libraries compatible with HDFS
and MapReduce that enable the extension of its applicability to the various needs of coordination, analysis,
performance management and workflow design that normally occur in Big Data applications.
The distributed nature of information requires a specific technological effort for transmitting big quantities of
data and for monitoring the overall system performance using special benchmarking techniques (Xiong et al. 2013).
Another fundamental technological element is the ability to store a bigger quantity of data on smaller physical
devices. Although Moore’s law suggests that storing capacity increases over time in an exponential manner (2006),
still it is required a continuous and expensive research and development effort to keep up with the pace at which
data size increases (Hilbert & López 2011) especially with the growing share of byte-hungry data types such as
images, sounds and videos.
Transforming Big Data in Value: Methods
The analysis of extensive quantities of data and the need to grasp value out of individual behaviors require
processing methods that go beyond the traditional statistical techniques. The knowledge of such methods, of their
potential and, above all, of their limitations requires specific skills that are hard to find in today’s job marketplace.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
Both Manyika et al. (2011) and Chen (2012) propose a list of Big Data Analytical Methods, that include (in
alphabetical order): A/B testing, Association rule learning, Classification, Cluster analysis, Data fusion and data
integration, Ensemble learning, Genetic algorithms, Machine learning, Natural Language Processing, Neural
networks, Network analysis, Pattern recognition, Predictive modelling, Regression, Sentiment Analysis, Signal
Processing, Spatial analysis, Statistics, Supervised and Unsupervised learning, Simulation, Time series analysis and
Visualization.
Chen et al. (2012) evoke the need for companies to invest in Business Intelligence and Analytics education that
would be “interdisciplinary and cover critical analytical and IT skills, business and domain knowledge, and
communication skills required in a complex data-centric business environment”. The investment in analytical
knowledge should be accompanied by a cultural change that would span across all employees and urge them to
efficiently manage data properly and incorporate them into decision making processes(Buhl et al. 2013). Mayer-
Schönberger and Cukier (2013) envision the rise of new specific professional entities, called algorithmists, that
would master the areas of computer science, mathematics and statistics and act as “impartial auditors to review the
accuracy or validity of Big Data predictions”. Also Davenport and Patil (2012) describe data scientist as a hybrid of
data hacker, analyst, communicator, and trusted adviser”, having also the fundamental abilities to write code and
conduct, when needed, academic-style research. These skills are not sufficiently available to meet the increasing
demand: according to Manyika et al. (2011), by the year 2018 there will be a potential shortfall of 1.5 million data-
savvy managers and analysts, in the US only. The analysis of competency gaps and the creation of effective teaching
methods to fill them for both future and current managers and practitioners is a promising research area that has still
much opportunity to grow.
Also the ability of making informed decisions is changing with the expansion of Big Data as the latter implies
the shift from logical, causality-based reasoning to the acknowledgment of correlation links between events. The
utilization of insights generated through Big Data Analytics in companies, universities and institutions provides for
an adaptation to a new culture of decision making (McAfee & Brynjolfsson 2012) and an evolution of the scientific
method (Anderson 2007), both of which are still to be built and provide opportunities for future research.
Being aware of the limitations of Big Data Methods and potential methodological issues is a fundamental
resource for organizations who want to drive data-based decision making: for example, predictions should always be
accompanied by valid confidence intervals in order to avoid the false sense of precision that the apparent
sophistication of some Big Data applications can suggest. Analysts should also be capable of avoiding models
overfitting that would facilitate apophenia, i.e. the tendency of humans to see patterns where none actually exist
simply because enormous quantities of data can offer connections that radiate in all directions”, (Boyd & Crawford
2012).
In a summary, Big Data requires the mastery of specific techniques, awareness of their strengths and limitations,
and a spread cultural tendency to informed decision making that in most cases has still to be built.
How Big Data Changes our Lives: Impact
The extent to which Big Data is impacting our society and our companies is often depicted through anecdotes
and success stories of methods and technology implementations. When these stories are accompanied by proposals
of new principles and methodological improvements they represent a valuable contribution to the creation of
knowledge on the subject. The pervasive nature of the current information production and availability leads to many
applications spanning in numerous scientific fields and industry sectors that can be very distant from each other.
Sometimes, the same techniques and data have been applied to solve problems in distant domains. For example,
correlation analysis was leveraged to use logs of Google searches to forecast influenza epidemics (Ginsberg et al.
2009) as well as unemployment (Askitas & Zimmermann 2009) and inflation (Guzman 2011). The existing Big Data
applications are many and expected to grow: hence, their systematic description constitutes a promising
development area for those willing to contribute in the scientific progress in this field.
Big Data can also impact society adversely. In fact, there are multiple concerns arising from the quick
advancement of Big Data (Boyd & Crawford 2012) first being privacy. Although large data sets would normally
proceed from actions done by a multitude of individuals, it is not always true that consequences of using that data
will not impact a single individual in an invasive and/or unexpected way. The identifiability of the individual person
can be avoided through a thorough anonymization of the data set, although it is hard to be fully guaranteed as the
reverse process of de-anonymization can be potentially attempted (Narayanan & Shmatikov 2008). The
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
predictability of future actions, made possible by the analysis of behavioral patterns, poses also the ethical issue of
protecting free will in the future, on top of freedom in the present.
Other issues to be considered are related to the accessibility of information: the exclusive control over data
sources can become an abuse of dominant position and restrict competition by posing unfair entrance barriers to the
marketplace. For example, as Manovich notices (2011), only social media companies have access to really large
social data especially transactional dataand they have full control over who can access what information. The
split between information-rich and data-lacking companies can create a new digital divide (Boyd & Crawford 2012)
that can slow down innovation in the sector. Specific policies will have to be promoted and data is likely to become
a new dimension to consider within antitrust regulations.
Not only society but also companies are heavily impacted by the rise of Big Data: the call to arms for acquiring
vital skills and technology to be competitive in a data-driven market implies a serious reconsideration of the firm
organization and the full realm of business processes (Pearson & Wegener 2013). The transformation of data into
competitive advantage (McAfee & Brynjolfsson 2012) is what makes “Big Data” such an impactful revolution in
today’s business world.
FIGURE 2. Big Data key topics in existing research.
A DEFINITION FOR BIG DATA
A convincing definition of a concept is an enabler of its scientific development. As Ronda-Pupo and Guerras-
Martin (2012) suggest, the level of consensus shown by a scientific community on a definition of a concept can be
used as a measure of progress of a discipline. Big Data has instead evolved so quickly and disorderly that such a
universally accepted formal statement denoting its meaning does not exist. There have been many attempts of
definition for Big Data, more or less popular in terms of utilization and citation. However, none of these proposals
has prevented authors of Big Data-related works to extend, renovate or even ignore previous definitions and propose
new ones. Although Big Data is still a relatively young concept, it certainly deserves an accepted vocabulary of
reference that enables the proper development of the discipline among cognoscenti and practitioners.
Information Technology
Methods
Impact
Big Data
Parallel
Computing
Machine
Learning
Programming
Paradigms
Value
Creation
Privacy
Emerging
Skills
Applications
Internet of
Things
Datafication
Distributed
Systems
Storage
Capabilities
Decision
Making
Organizations
Visualization
Overload
Diverse
Unstructured
Society
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
In the first part of this paper we have identified the four main themes of Big Data and we have observed that they
are the prevalent topics in the existing literature. In the next paragraphs we will review a non-exhaustive list of
previously proposed Big Data definitions and we will conceptually tie them to the aforementioned four themes of
research. After considering the existing definitions and analyzing their commonalities we will propose a consensual
definition of Big Data. Consensus in this case comes from the acknowledgement of centrality of some recurring
attributes associated to Big Data, and from the assumption that they define the essence of what Big Data means to
scholars and practitioners today. We expect that such a definition would be less prone to attack from previous
definitions’ authors and users as it is based on the most central aspects associated until now to Big Data.
A thorough consensus analysis based on Cohen’s K coefficient (1960) and co-word analysis, as in (Ronda-Pupo
& Guerras-Martin 2012), goes beyond the scope of this work and is left for future study.
Survey of Existing Definitions
Big Data has been often described “implicitly” through success stories or anecdotes, characteristics,
technological features, emerging trends or its impact to society, organizations and business processes. In the existing
attempts of explicit definitions for Big Data there is not even an agreement on what entity this term can be
associated with. We have found that Big Data is used when referring to a variety of different entities including - but
not limited to - social phenomenon, information assets, data sets, analytical techniques, storage technologies,
processes and infrastructures. We have surveyed multiple definitions that have been proposed to date and listed them
in Tab. 1: in this paragraph we will go through the most notable ones.
A first group of Big Data definitions focuses on enlisting its characteristics. What is probably the most popular
definition falls within this group. When presenting the Data Management challenges that companies had to face in
response to the rise of e-commerce in the early 2000’s, Laney introduces a framework expressing the 3-dimensional
increase in data Volume, Velocity and Variety and invokes the need for new formal practices that will imply
tradeoffs and architectural solutions that involve/impact application portfolios and business strategy decisions
(2001). Although this work did not mention Big Data explicitly, the model, later nicknamed as “the 3 V’s”, was
associated to the concept of Big Data and used as its definition (Beyer & Laney 2012; Eaton et al. 2012; Zaslavsky
et al. 2013). Many other authors extended the “3 V’s” model and, as a result, multiple features of Big Data, like
Value (Dijcks 2012), Veracity (Schroeck et al. 2012), Complexity and Unstructuredness (Intel 2012; Suthaharan
2013), were added to the list.
A second group of definitions emphasizes the technological needs behind the processing of large amounts of
data. According to Microsoft, Big Data is about applying serious computing powerto massive sets of information
(2013) and also the National Institute of Standards and Technology (NIST) highlights the need for a scalable
architecture for efficient storage, manipulation, and analysis” when defining Big Data (2014).
A few definitions associate Big Data to the crossing of some sort of threshold: for instance Dumbill (2013)
asserts that data is Big when it “exceeds the processing capacity of conventional database systemsand requires the
choice of “an alternative way to process it. Fisher (2012) acknowledges that the size that constitutes “big” has
grown according to Moore’s Law and links the absolute level of this threshold to the capacity of commercial storing
solutions: Big Data is so large as to not fit on a single hard driveand, hence will be stored on several different
disks”.
A last group of definitions highlights the impact of Big Data advancement on society. Boyd and Crawford (2012)
notice that “Big Data is less about data that is big than it is about a capacity to search, aggregate, and cross-
reference large data sets. They define Big Data as “a cultural, technological, and scholarly phenomenon that rests
on the interplay of Technology (maximizing computation power and algorithmic accuracy), Analysis (to identify
patterns on large data sets) and Mythology (meaning the belief that large data sets offer a higher form of intelligence
with an aura of truth, objectivity and accuracy). Mayer-Schönberger and Cukier (2013) describe Big Data by
enlisting the three key shifts in the way we analyze information that transform how we understand and organize
society”: 1. More data”, in terms of “completeness” of the data set, using all of available data instead of a sample
of it; 2.More messy”, meaning that we can loosen up on our desire for exactitude and use also incomplete or less
accurate input data; 3. Correlationbecomes more important and overtakes “causality” as a way to make sense of
trends and finally make decisions.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
TABLE 1. Existing definitions of Big Data, adapted from the articles referenced in the first column. The last four columns
indicate whether the definition alludes to each of the four Big Data themes identified in the first section of the paper, through the
following legend: I - Information, T - Technology, M - Methods, P - Impact.
Source
Definition
T
M
P
(Beyer & Laney 2012)
High volume, velocity and variety information assets that
demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.
x
x
x
(Dijcks 2012)
The four characteristics defining big data are Volume,
Velocity, Variety and Value.
x
x
(Intel 2012)
Complex, unstructured, or large amounts of data.
x
(Suthaharan 2013)
Can be defined using three data characteristics: Cardinality,
Continuity and Complexity.
x
(Schroeck et al. 2012)
Big data is a combination of Volume, Variety, Velocity and
Veracity that creates an opportunity for organizations to gain
competitive advantage in today’s digitized marketplace.
x
x
(NIST Big Data Public
Working Group 2014)
Extensive datasets, primarily in the characteristics of volume,
velocity and/or variety, that require a scalable architecture for
efficient storage, manipulation, and analysis.
x
x
(Ward & Barker 2013)
The storage and analysis of large and or complex data sets
using a series of techniques including, but not limited to:
NoSQL, MapReduce and machine learning.
x
x
x
(Microsoft 2013)
The process of applying serious computing power, the latest
in machine learning and artificial intelligence, to seriously
massive and often highly complex sets of information.
x
x
x
(Dumbill 2013)
Data that exceeds the processing capacity of conventional
database systems.
x
x
(Fisher et al. 2012)
Data that cannot be handled and processed in a
straightforward manner.
x
x
(Shneiderman 2008)
A dataset that is too big to fit on a screen.
x
(Manyika et al. 2011)
Datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.
x
x
x
(Chen et al. 2012)
The data sets and analytical techniques in applications that are
so large and complex that they require advanced and unique
data storage, management, analysis, and visualization
technologies.
x
x
x
(Boyd & Crawford
2012)
A cultural, technological, and scholarly phenomenon that rests
on the interplay of Technology, Analysis and Mythology.
x
x
x
(Mayer-Schönberger &
Cukier 2013)
Phenomenon that brings three key shifts in the way we
analyze information that transform how we understand and
organize society: 1. More data, 2. Messier (incomplete) data,
3. Correlation overtakes causality.
x
x
x
Consensual Definition
By looking at both the existing definitions of Big Data and at the main research topics associated to it, we can
affirm that the nucleus of the concept of Big Data can be expressed by:
‘Volume’, ‘Velocity’ and ‘Variety’, to describe the characteristics of Information involved;
Specific ‘Technology’ and ‘Analytical Methods’, to clarify the unique requirements strictly needed to
make use of such Information;
Transformation into insights and consequent creation of economic ‘Value’, as the principal way Big
Data is impacting companies and society.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
We believe that the “object” to which Big Data should refer to in its definition is ‘Information assets’, as this
entity is clearly identifiable and is not dependent on the field of application.
Therefore, we propose the following formal definition:
“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to
require specific Technology and Analytical Methods for its transformation into Value.”
Such a definition of Big Data is compatible with the existence of terms like “Big Data Technology” and “Big
Data Methods” that should be used when referring directly to the specific technology and methods mentioned in the
main definition.
CONCLUSION
Big Data has recently become a voguish term among researchers and IT professionals. Its success is propelled by
a frequent utilization in a broad range of contexts and with several, and often incongruous, acceptations. As a result,
its meaning is still nebulous and this hinders an organized evolution of the subject.
We have conducted an analysis of the usage of this term in literature and concluded that the top four themes
associated to Big Data are: Information, Technology, Methods and Impact. We have then have suggested a
definition that is coherent with the current “as is” utilization of the term and consensual with the most prominent
definitions that have been so far proposed. We suggest using Big Data as a standalone term when referring to those
Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value” and as an attribute when denoting its peculiar requisites, e.g.
“Big Data Technology” or “Big Data Analytical Methods”. We believe that using this definition from now on will
allow a more efficient scientific development of the matter.
Possible extensions to the present work include:
A systematic literature review of “Big Data” by means of quantitative methods, such as co-word,
cluster and frequency analysis. The review should also identify a more granular list of research topics
through systemic methods like topic modeling.
Study of how Big Data is systematically impacting on the creation of economic value in companies and
a proposal of guidelines for a coherent development of system and processes related to Business
Intelligence and Analytics. We can presume that the value creation chain would go through the four
themes of Big Data and that maximizing the value each component brings would generate higher
returns on BI&A investments.
REFERENCES
Anderson, C., 2007. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired, p.3.
Askitas, N. & Zimmermann, K.F., 2009. Google Econometrics and Unemployment Forecasting. Applied Economics
Quarterly, 55(2), pp.107120.
Atzori, L., Iera, A. & Morabito, G., 2010. The Internet of Things: A survey. Computer Networks, 54(15), pp.2787
2805.
Beyer, M.A. & Laney, D., 2012. The Importance of “Big Data”: A Definition. Gartner Publications, pp.19.
Boyd, D. & Crawford, K., 2012. Critical Questions for Big Data. Information, Communication & Society, 15(5),
pp.662679.
Buhl, H.U. et al., 2013. Big Data. Business & Information Systems Engineering, 5(2), pp.6569.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
Chen, H., Chiang, R. & Storey, V., 2012. Business Intelligence and Analytics: From Big Data to Big Impact. MIS
Quarterly, 36(4), pp.11651188.
Chui, M., Löffler, M. & Roberts, R., 2010. The Internet of things. McKinsey Quarterly, 291(2), p.10.
Cohen, J., 1960. A coefficient of agreement of nominal scales. Educational and Psychological Measurement, 20,
pp.3746.
Coyle, K., 2006. Mass Digitization of Books. Journal of Academic Librarianship, 32(6), pp.641645.
Davenport, T.H. & Patil, D.J., 2012. Data Scientist: The Sexiest Job Of the 21st Century. Harvard Business Review,
90(10), pp.7076.
Dean, J. & Ghemawat, S., 2008. MapReduce: Simplified Data Processing on Large Clusters. Communications of the
ACM, 51(1), pp.113.
Dijcks, J., 2012. Oracle: Big data for the enterprise. Oracle White Paper, (June).
Dumbill, E., 2013. Making Sense of Big Data. Big Data.
Eaton, C. et al., 2012. Understanding Big Data, McGraw-Hill Companies.
Estrin, D. et al., 2002. Connecting the physical world with pervasive networks. IEEE Pervasive Computing, 1(1),
pp.5969.
Evans, D., 2011. The Internet of Things - How the Next Evolution of the Internet is Changing Everything. CISCO
white paper, (April), pp.111.
Fisher, D. et al., 2012. Interactions with Big Data Analytics. interactions.
Gartner, 2014. Gartner Says the Internet of Things Will Transform the Data Center.
Ghemawat, S., Gobioff, H. & Leung, S.-T., 2003. The Google file system. ACM SIGOPS Operating Systems
Review, 37(5), p.29.
Ginsberg, J. et al., 2009. Detecting influenza epidemics using search engine query data. Nature, 457(7232),
pp.10121014.
Guzman, G., 2011. Internet search behavior as an economic forecasting tool: The case of inflation expectations.
Journal of economic and social measurement, 36(3), pp.119167.
Hilbert, M. & López, P., 2011. The world’s technological capacity to store, communicate, and compute information.
Science (New York, N.Y.), 332(6025), pp.6065.
Intel, 2012. Big Data Analytics. Intel’s IT Manager Survey on How Organizations Are Using Big Data.
Laney, D., 2001. 3D data management: Controlling data volume, velocity and variety. META Group Research Note,
6 (February 2001).
Manovich, L., 2011. Trending: The Promises and the Challenges of Big Social Data. Debates in the Digital
Humanities, pp.110.
Preprint copy
Presented at 4th International Conference on Integrated Information.
Accepted for publication in AIP Proceedings, exp. end 2014.
Manyika, J. et al., 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global
Institute.
Mayer-Schönberger, V. & Cukier, K., 2013. Big Data: A Revolution That Will Transform How We Live. Work and
Think, London: John Murray.
McAfee, A. & Brynjolfsson, E., 2012. Big data: the management revolution. Harvard business review, (October
2012).
Michel, J.-B. et al., 2011. Quantitative analysis of culture using millions of digitized books. Science (New York,
N.Y.), 331(6014), pp.17682.
Microsoft, 2013. The Big Bang: How the Big Data Explosion Is Changing the World.
Moore, G.E., 2006. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38,
number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Newsletter, 20(3).
Narayanan, A. & Shmatikov, V., 2008. Robust de-anonymization of large sparse datasets. In Proceedings - IEEE
Symposium on Security and Privacy. pp. 111125.
NIST Big Data Public Working Group, 2014. Big Data Interoperability Framework: Definitions (draft).
Pearson, T. & Wegener, R., 2013. Big Data: The organizational challenge.
Ronda-Pupo, G.A. & Guerras-Martin, L.Á., 2012. Dynamics of the evolution of the strategy concept 1962-2008: a
co-word analysis. Strategic Management Journal, 33(2), pp.162188.
Russom, P., 2011. Big data analytics. TDWI Best Practices Report, Fourth Quarter.
Schroeck, M. et al., 2012. Analytics: The real-world use of big data.
Shneiderman, B., 2008. Extreme visualization: squeezing a billion records into a million pixels. International
conference on Management of data, pp.312.
Shvachko, K. et al., 2010. The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies, MSST2010.
Suthaharan, S., 2013. Big Data Classification: Problems and challenges in network intrusion prediction with
machine learning. Big Data Analytics workshop.
Viegas, F.B. et al., 2007. Many Eyes: A site for visualization at internet scale. IEEE Transactions on Visualization
and Computer Graphics, 13(6), pp.11211128.
Ward, J. & Barker, A., 2013. Undefined By Data: A Survey of Big Data Definitions. arXiv preprint
arXiv:1309.5821.
Xiong, W. et al., 2013. A characterization of big data benchmarks. In Big Data, 2013 IEEE International
Conference on. pp. 118125.
Zaslavsky, A., Perera, C. & Georgakopoulos, D., 2013. Sensing as a service and big data. arXiv preprint.
... This definition of digital transformation remains relatively vague -mainly highlighting its major societal impactbut has the advantage of being all-encompassing, allowing it to take into account the variety of phenomena associated with digital transformation. The literature identifies numerous phenomena as "digital" phenomena, be it Industry 4.0 (Rojko 2017;Schneider and Sting 2020), New Ways of Working (Jemine et al. 2019;Kingma 2019), Artificial Intelligence (Ågerfalk 2020) or Big Data (McAfee and Brynjolfsson 2012;De Mauro et al. 2014). Even though digital transformation has been identified as a societal phenomenon for about a decade, the multiplicity of organizational and technological phenomena as well as the buzzword and catch-all concept dimension of digital transformation make it something still loosely defined. ...
... The phenomenon called "Big Data" refers to the technological improvements that have made possible to handle (i.e. store, access, and analyze) huge volumes of data in various formats (De Mauro et al. 2014). Gareth Morgan, in his conversation with Oswick and Grant (2016), speaks about this phenomenon: "Big data in this world, that's hugely important. ...
Chapter
This book offers an up-to-date collection of works on metaphors in the area of organization studies. The mission of the book is to increase the interest for metaphor-based and metaphor-oriented research within the area of organization studies; further knowledge on “metaphor” can help to increase the potential of metaphor in organization studies. The book acknowledges the usability of metaphor and metaphors in the area of organization studies and also acknowledges the existence of, explores, and suggests solutions to challenges that metaphor use comes with, in order to stimulate further use of both metaphor and metaphors in organization studies. The book is an effort to offer a smorgasbord of current (at the beginning of the 2020s) works that in one way or another use metaphor or metaphors in the study of organizing and organizations. Some of the contributors are well established within the area of metaphor in organization studies, many of them having played major roles in the development of the field throughout the years. Others have more recently begun using metaphor and metaphors in their studies of organizations and organizing. The book contains, within the area of organization studies—broadly defined—chapters offering theoretical considerations on metaphor; chapters exemplifying the use of metaphors in organization studies; chapters discussing methods for using metaphors in research; chapters dealing with the use of metaphors in teaching as well as in practice; and chapters offering various perspectives on metaphor.
... With the rise in availability of 'big data' sources, including social media and citizen science data, there may be opportunities to further improve the accuracy and temporal relevance of proposed methods for mapping IGS. Big data sources are characterised by the availability of rapidly updating, abundant information from a diverse range of sources, including from government and non-government actors (De Mauro et al., 2015, Thakuriah et al., 2017. This may be of particular value in studying and mapping IGS as it could provide real-time information on the land-use intent of multiple government and non-government actors. ...
... [49] presented insights into several companies' move towards implementing industry 4.0 technology in their supply chain, taking managerial perspectives into consideration. According to Moore's Law and described by Prisecaru [34], the processing capability of core processing units doubles roughly every twelve to eighteen [12][13][14][15][16][17][18] months. If taken into consideration as a catalyst for the digitalization path due to its connection between people and machines, and data in real time, then a future where fully digitalized supply chains in most industries are standard is close. ...
Article
The study used multiple case studies to explore the findings from literature review on adoption of industry 4.0 technologies in manufacturing supply chains. It displays a digitalized supply chain as one of the best options for optimization of manufacturing companies processes and provides insights and some guidance on the industry 4.0 technologies for manufacturing companies to prioritize when starting the digitalization journey; to improve decision making, maximize efficiency and minimize costs. The main objective of the study is to explore various industry 4.0 technologies used in manufacturing supply chains and two propositions were suggested based on the three case companies investigated. The digitalization of manufacturing supply chains has an overall positive impact on how the supply chains operates and improves productivity and growth. It was concluded that industry 4.0 technologies are valuable tools from a managerial perspective, because they provide better process visibility and tracking of requisitions, improved efficiency, optimization of resources, easy to use templates, improved access to ordering data and reporting, improved decision making, and the supply chains are more autonomous.
... As a result, organisations today have access to billions of zettabytes of data about clients, partners, processes, and rivals. The word "Big Data" [9] is used to describe the quantity of this data, which is produced in real-time, amasses quickly, lacks a defined structure, and is typically homogeneous. ...
... In order to solve this problem, experts are trying to propose a standard, that is, a commonly defined definition of Big Data. De Mauro and other authors [11] propose a consensual formal definition in which: ''Big Data represents an information asset characterized by high volume, speed and diversity, which requires specific technology and analytical methods to transform it into value''. Other experts like Floridi criticize traditional attributive definitions because they still remain vague and insufficiently precise, and do not clarify on the real way, what exactly does Big Data mean. ...
Conference Paper
Business operations of companies in modern conditions are subject to enormous market, social and especially technological pressures from the environment. Information and communication technologies have become so incorporated both in our everyday life and in the operations of every company, that without them we feel almost lost and helpless. Big Data, as a theoretical (philosophical) concept has existed for decades, but only recently, thanks to the extraordinarily rapid development of information and communication technologies, it has become applicable in practice, and as a business concept it has been recognized as a unique opportunity for success in the business world. Like all organizations, small and medium-sized enterprises can find a unique opportunity to improve their own business in the application of this concept. The number of users is growing exponentially, generating a huge amount of different data every second through different sources (YouTube, Twitter, Instagram, Facebook, Google, Skype, Internet, E-mail). All those unimaginably large amounts of data need to be stored somewhere: processed, analyzed, presented and interpreted, and then propose (suggest) specific business solutions based on those results. Realization of those activities in real or reasonable time, and often unexpected and surprising conclusions, are made possible by the Big Data concept. This article aims to shed light on the concept and technology of Big Data and its application at the level of small and medium enterprises. Big Data is a theoretical and technological concept, which is able to revolutionize the way of decision-making in companies and achieve extraordinary and concrete results. A secondary, but no less important, goal of writing this paper is to point out the importance of small and medium-sized enterprises, which outnumber the large ones. Most of them strive for a stable, dominant and high market position, so it can be concluded that they are extremely important for development and progress of each country.
... Data yang dapat dikoleksi dari konsumen berjumlah sangat besar, dengan bentuk terstruktur maupun tidak terstruktur, lantas disebut sebagai big data. Menurut Mauro et al (2015), "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value." Big data adalah sebuah aset yang berbentuk informasi yang memiliki karakteristik bervolume besar, cepat, dan beragam yang membutuhkan teknologi dan teknik analisis khusus untuk mengubah data tersebut menjadi sesuatu yang bernilai. ...
Article
Full-text available
Green innovation directly encompasses the two major concepts of green and innovation in the new development concepts, which provides a powerful driving force to support Chinese-style modernisation. This paper empirically tests the relationship between green innovation and carbon emission intensity using a double fixed effects model. Based on the panel data of 30 provinces in China, the mediation effect model of “green innovation-big data-carbon emission” is constructed. The result shows that green innovation has a noticeable direct negative effect on urban carbon emission intensity. The conclusions are robust after considering measurement errors and endogenous problems. Furthermore, it is found that big data plays a significant role in strengthening the relationship between green innovation and carbon emission intensity. The findings in this study not only advance the study on green innovation and carbon emissions but also provide a new perspective on the role of big data.
Article
Full-text available
The significant rise in data volume and frequency, accompanied by the emergence of sophisticated technologies like artificial intelligence and machine learning for data analysis, has sparked a notable transformation across multiple sectors, particularly within the domain of central banking. Globally, many central banks are increasingly inclined towards harnessing the potential of big data to streamline their functions and strengthen decision-making processes. This study aims to spotlight the potential uses of big data within the operations of central banks, emphasizing both the opportunities and obstacles linked to its implementation. Additionally, it aims to assess the actual situation and hurdles in applying big data within Arab central banks by utilizing a questionnaire directed specifically to these institutions. The questionnaire sought to elicit the perspectives of Arab central banks on big data, shedding light on the most significant opportunities and challenges in this domain. Responses were gathered from eleven central banks situated in Jordan, the Emirates, Bahrain, Saudi Arabia, Sudan, Iraq, Qatar, Lebanon, Libya, Morocco, and Yemen. The study's results indicate that the majority of Arab central banks lack a clear comprehension and strategy for employing big data, showcasing diverse levels of consideration and application. These banks, however, foresee substantial benefits from embracing big data, especially in fraud detection, early warning system structuring, extensive regulatory supervision, utilizing Regulatory Technology (RegTech), and combating money laundering and terrorism financing. Yet, these banks face notable hurdles, including a scarcity of skilled professionals, managing vast data volumes, ensuring legal compliance, and prioritizing privacy protection, which remains paramount. Addressing these challenges requires a focused approach, including skill development, refining data management protocols, and bolstering cybersecurity measures within the organizational setup. Furthermore, fostering cooperation at local, regional, and global levels emerges as crucial for knowledge. The study emphasizes that while Arab central banks acknowledge big data's potential, significant impediments hinder its practical application. Therefore, overcoming these implements is essential to effectively leverage the potential of big data. To address these challenges, the study recommends increased investment in big data infrastructure and talent development, the establishment of robust governance frameworks for big data, and promoting cooperation to exchange knowledge and good practices. Implementing these measures is poised to empower Arab central banks, enhancing their operational efficiency and enabling them to execute their tasks in a more effective manner. Keywords: big data, central banks, monetary policy, financial stability, regulatory oversight, data privacy, cybersecurity, artificial intelligence, machine learning.
Chapter
Corporate logistics network data is a complex system made up of a large number of interacting entities, used to control, implement and plan the transport and distribution of goods and raw materials. They are essential to supply chain management and subsequently to customer satisfaction. Today, international freight transport is one of the main pillars of world trade. Its importance lies in the diversity of studies and analyses that have been carried out on this type of transport, covering all aspects of international trade - administrative, political, technical and economic - with the aim of developing and modernising this mode of transport. Developing and modernising this logistics network has therefore become a major objective for players in the global economy, given the rapid pace of technological, IT and scientific development. However, the use of intelligent and optimisation techniques in logistics networks has always faced constraints in terms of application and decision-making, with the volume of goods transported increasing every year, leading to new challenges for researchers and operators in this area of international trade. With this in mind, the aim of this literature review is to outline intelligent solutions based on artificial intelligence (AI) and Big Data (BD), in relation to logistics networks, while reviewing the state of the art of recently published studies that offer perspectives with the use of AI for the sustainability of logistics networks of offshore companies.
Article
Full-text available
No son novedosas aquellas inquietudes que han surgido con relación al uso y aprovechamiento que realizan las empresas de los datos de sus usuarios y consumidores. Mayoritariamente la regulación se ha desarrollado desde una perspectiva constitucional, orientada a la protección a los derechos fundamentales de los originadores de los datos. Esta situación parece desconocer de manera alguna la importancia que estos datos tienen de cara a la estructuración de un modelo de negocio. Por tal motivo, este trabajo estudia de una parte los tipos y naturaleza de los datos que pueden recolectarse y usarse para fines empresariales, e igualmente su importancia para los empresarios. De otro lado, se analiza la coherencia del tratamiento jurídico que se da a los datos en función de su naturaleza y uso, así como se persigue establecer si desde el derecho privado se pueden proteger los datos personales como insumo de la actividad empresarial. Para el desarrollo de este trabajo se utilizó una metodología de investigación socio jurídica con un énfasis principalmente cualitativo, concretado en el análisis de fuentes legales, jurisprudenciales y doctrinarias, así como la elaboración de estudios de campo. A partir de esta investigación se identificaron dinámicas de mercantilización del dato personal, que hacen visible la necesidad de asignarles un tratamiento jurídico ius privatista al uso y aprovechamiento de estos por parte del empresario. Asimismo, se distinguió el ámbito de protección de las normas de protección de datos vigentes y las situaciones que escapan a dicha órbita, siendo merecedoras de un tratamiento jurídico acorde con la realidad empresarial. Estos hallazgos permitieron visibilizar esta circunstancia y enfatizar en la necesidad de observar los fenómenos económicos y sociales presentes en los distintos estamentos de la sociedad, a fin de valorar la necesidad, efectividad y coherencia de la regulación vigente.
Article
Full-text available
Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework.
Article
The term Big Data is applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set. This chapter addresses some of the theoretical and practical issues raised by the possibility of using massive amounts of social and cultural data in the humanities and social sciences. These observations are based on the author’s own experience working since 2007 with large cultural data sets at the Software Studies Initiative at the University of California, San Diego. The issues discussed include the differences between ‘deep data’ about a few people and ‘surface data’ about many people; getting access to transactional data; and the new “data analysis divide” between data experts and researchers without training in computer science.
Article
The pathways through which information is gathered, stored, and dispatched in organizations are changing: the physical world itself is becoming a type of information system. In what's called the Internet of Things, sensors and actuators embedded in physical objectsfrom roadways to pacemakersare linked through wired and wireless networks, often using the same Internet Protocol (IP) that connects the Internet. These networks churn out huge volumes of data as they sense the environment and as devices communicate with one another. The data can be marshaled to aid decision makers or to respond automatically to events. These emerging information networks promise to change business models for many companies, offering new ways to interact with consumers, fine-tune processes for greater productivity, automate dangerous tasks, and better manage risk.