Content uploaded by Andrea De Mauro
Author content
All content in this area was uploaded by Andrea De Mauro on Feb 19, 2015
Content may be subject to copyright.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
What is Big Data? A Consensual Definition and a Review of
Key Research Topics
Andrea De Mauro1, a), Marco Greco2, b) and Michele Grimaldi2, c)
1Department of Enterprise Engineering, University of Rome Tor Vergata, Via del Politecnico 1, 00133 Roma, Italy
2Department of Civil and Mechanical Engineering, University of Cassino and Southern Lazio, Via Di Biasio 43,
03043 Cassino (FR), Italy
a)Corresponding author: andrea.de.mauro@uniroma2.it
b)m.greco@unicas.it
c)m.grimaldi@unicas.it
Abstract. Although Big Data is a trending buzzword in both academia and the industry, its meaning is still shrouded by
much conceptual vagueness. The term is used to describe a wide range of concepts: from the technological ability to
store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in
information overload. The lack of a formal definition has led research to evolve into multiple and inconsistent paths.
Furthermore, the existing ambiguity among researchers and practitioners undermines an efficient development of the
subject. In this paper we have reviewed the existing literature on Big Data and analyzed its previous definitions in order
to pursue two results: first, to provide a summary of the key research areas related to the phenomenon, identifying
emerging trends and suggesting opportunities for future development; second, to provide a consensual definition for Big
Data, by synthesizing common themes of existing works and patterns in previous definitions.
Keywords: Big Data; Analytics; Information Management; Data Processing; Business Intelligence.
INTRODUCTION
Big Data
1
has now become a ubiquitous term in many parts of industry and academia. As often happens in these
cases, the frequent utilization of the same words in different contexts poses a threat towards the structured evolution
of its meaning. For this reason it is necessary to invest time and effort in the proposition and the acceptance of a
standard definition of Big Data that would pave the way to its systemic evolution and minimize the confusion
related to its usage. In order to describe Big Data we have decided to start from an “as is” analysis of the contexts in
which the term most frequently appears. Given its remarkable success and its hectic evolution, Big Data possesses
multiple and diverse nuances of meaning, all of which have the right to exist. By analyzing the most significant
occurrences of this term in both academic and business literature we have identified four key themes to which Big
Data refers: Information, Technologies, Methods and Impact. We can reasonably assert that the vast majority of
references to Big Data encompass one of the four themes listed above. Understanding how these themes have been
dealt with in existing literature and how they are mutually interconnected is the objective of the first section of this
paper and is propaedeutic to the attempt of proposing a thorough definition, which is what the second section aims
to provide. We believe that having such a definition will enable a more conscious usage of the term Big Data and a
more coherent development of research on this subject.
1
We have chosen to capitalize the term ‘Big Data’ throughout this article to clarify that it is the specific subject we are discussing.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
REVIEW OF MAIN RESEARCH TOPICS
This section represents a comprehensive but non-exhaustive review of research topics in the area of Big Data.
We have examined a large number of abstracts of peer-reviewed conference and journal papers and identified
recurring topics by looking at the appearance frequency of top keywords and making an educated guess on their
interrelation. We needed to apply this heuristic approach in order to produce a depiction of the ample range of
concepts related to Big Data while using a relatively small number of topic categories. A systematic literature
review is beyond the scope of this paper and left as an opportunity for future work. The input list of documents was
obtained from Elsevier’s Scopus, a citation database containing more than 50 million records from around 5,000
publishers. On the 3rd of May 2014 we exported a list of 1,581 conference papers and articles that contained the full
term “Big Data” in either the title or within the author-provided keywords
2
. We have removed those entries where
the abstract text was not available and this left us with a corpus of 1,437 documents. By counting the appearance
frequency of words included in the abstracts we have identified the most recurring items. Figure 1 shows a static tag
cloud visualization (also known as “word cloud”) of the most popular words in the abstracts we analyzed, obtained
through the online tool ManyEyes (Viegas et al. 2007).
By analyzing the most frequent keywords included in Big Data-related abstracts and considering their mutual
relationships we have identified four top research themes in current literature, namely: 1. Information, 2.
Technology, 3. Methods, 4. Impact. We believe that the great majority of papers written on Big Data touch upon one
or more of these four topics. For each of them we will now describe content, trends and enlist a number of relevant
works.
FIGURE 1. Static tag cloud visualization (word cloud) of key terms appearing in abstracts of Big Data-related papers.
The Fuel of Big Data: Information
One of the fundamental reasons for Big Data phenomenon to exist is the current extent to which information can
be generated and made available. Digitization, i.e. the process of converting continuous, analog information into
discrete, digital and machine-readable format, reached broad popularity with the first “mass digitization” projects.
Mass digitization is the attempt to convert entire printed book libraries into digital collections by leveraging optical
character recognition (OCR) software in order to minimize human intervention (Coyle 2006). One of the most
popular attempts of mass digitization was the Google Print Library Project
3
, started in 2004, that aimed at digitizing
more than 15 million volumes held in multiple university libraries, including Harvard, Stanford and Oxford. More
2
We have used the following search query in Scopus: “AUTHKEY("Big data") OR TITLE("big data") AND (LIMIT-TO(DOCTYPE, "cp") OR
LIMIT-TO(DOCTYPE, "ar") OR LIMIT-TO(DOCTYPE, "ip"))”
3
For more information you can visit the Google Books History page, available at http://www.google.com/googlebooks/about/history.html.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
recently it has been proposed a subtle differentiation between digitization and its next step, datafication, i.e. putting a
phenomenon in a quantified format so that it can be tabulated and analyzed (Mayer-Schönberger & Cukier 2013).
The fundamental difference is that digitization enables analog information to be transferred and stored in a more
convenient digital format while datafication aims at organizing digitized version of analog signals in order to
generate insights that would have not been inferred while signals were in their original form. In the case of the
previously cited Google mass digitization effort, the value of datafication came when researchers showed they were
able to provide insights on lexicography, the evolution of grammar, collective memory, the adoption of technology,
the pursuit of fame, censorship, and historical epidemiology by using Google Books’ data (Michel et al. 2011).
Digitization and datafication have become pervasive phenomena thanks to the broad availability of devices that
are both connected and provided with digital sensors. Digital sensors enable digitization while connection lets data
be aggregated and, thus, permits datafication. Cisco estimated that between 2008 and 2009 the number of connected
devices overtook the number of living people (Evans 2011) and, according to Gartner (2014) by 2020 there will be
26 billion devices on earth, more than 3 devices on average per person. The pervasive presence of a variety of
objects (including mobile phones, sensors, Radio-Frequency Identification - RFID - tags, actuators), which are able
to interact with each other and cooperate with their neighbors to reach common goals, goes under the name of the
Internet of Things, IoT (Estrin et al. 2002; Atzori et al. 2010). This increasing availability of sensor-enabled,
connected devices is equipping companies with extensive information assets from which it is possible to create new
business models, improve business processes and reduce costs and risks (Chui et al. 2010). In other words, IoT is
one of the most promising fuels of Big Data expansion.
Another characteristic of the data generated today is its increasing variety in type. Structured data (traditional
text/numeric information) is now joined by unstructured data (audio, video, images, text and human language) and
semistructured data, such as XML and RSS feeds (Russom 2011). The diversity of data types is one of the
challenges that organizations need to tackle in order to make value out of the extensive informational assets
available today (Manyika et al. 2011).
Equipment for Working with Big Data: Technology
The term Big Data is frequently associated with the specific technology that enables its utilization. The extent of
the dataset size and the complexity of operations needed for its processing entail stringent memory storage and
computational performance requirements. According to Google Trends, the most related query to “Big Data” is
“Hadoop” that indeed is the most prominent technology associated with this topic. Hadoop is an open source
framework that enables the distributed processing of big quantities of data by using a group of dispersed machines
and specific computer programming models. The main components of Hadoop are: 1. its file system HDFS, that
allows access to data scattered over multiple machines without having to cope with the complexity inherent to their
dispersed nature; 2. MapReduce, a programming model designed to implement distributed and parallel algorithms in
an efficient way. Both HDFS (Shvachko et al. 2010) and MapReduce (Dean & Ghemawat 2008) are the evolution of
concepts that were originally proposed by Google (Ghemawat et al. 2003) and that were then developed as open-
source projects within Apache’s framework. This proves the centrality of Google in the initiation of the current
thinking about Big Data. The Hadoop framework contains multiple modules and libraries compatible with HDFS
and MapReduce that enable the extension of its applicability to the various needs of coordination, analysis,
performance management and workflow design that normally occur in Big Data applications.
The distributed nature of information requires a specific technological effort for transmitting big quantities of
data and for monitoring the overall system performance using special benchmarking techniques (Xiong et al. 2013).
Another fundamental technological element is the ability to store a bigger quantity of data on smaller physical
devices. Although Moore’s law suggests that storing capacity increases over time in an exponential manner (2006),
still it is required a continuous and expensive research and development effort to keep up with the pace at which
data size increases (Hilbert & López 2011) especially with the growing share of byte-hungry data types such as
images, sounds and videos.
Transforming Big Data in Value: Methods
The analysis of extensive quantities of data and the need to grasp value out of individual behaviors require
processing methods that go beyond the traditional statistical techniques. The knowledge of such methods, of their
potential and, above all, of their limitations requires specific skills that are hard to find in today’s job marketplace.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
Both Manyika et al. (2011) and Chen (2012) propose a list of Big Data Analytical Methods, that include (in
alphabetical order): A/B testing, Association rule learning, Classification, Cluster analysis, Data fusion and data
integration, Ensemble learning, Genetic algorithms, Machine learning, Natural Language Processing, Neural
networks, Network analysis, Pattern recognition, Predictive modelling, Regression, Sentiment Analysis, Signal
Processing, Spatial analysis, Statistics, Supervised and Unsupervised learning, Simulation, Time series analysis and
Visualization.
Chen et al. (2012) evoke the need for companies to invest in Business Intelligence and Analytics education that
would be “interdisciplinary and cover critical analytical and IT skills, business and domain knowledge, and
communication skills required in a complex data-centric business environment”. The investment in analytical
knowledge should be accompanied by a cultural change that would span across all employees and urge them to
“efficiently manage data properly and incorporate them into decision making processes” (Buhl et al. 2013). Mayer-
Schönberger and Cukier (2013) envision the rise of new specific professional entities, called algorithmists, that
would master the areas of computer science, mathematics and statistics and act as “impartial auditors to review the
accuracy or validity of Big Data predictions”. Also Davenport and Patil (2012) describe data scientist as a hybrid of
“data hacker, analyst, communicator, and trusted adviser”, having also the fundamental abilities to write code and
conduct, when needed, academic-style research. These skills are not sufficiently available to meet the increasing
demand: according to Manyika et al. (2011), by the year 2018 there will be a potential shortfall of 1.5 million data-
savvy managers and analysts, in the US only. The analysis of competency gaps and the creation of effective teaching
methods to fill them for both future and current managers and practitioners is a promising research area that has still
much opportunity to grow.
Also the ability of making informed decisions is changing with the expansion of Big Data as the latter implies
the shift from logical, causality-based reasoning to the acknowledgment of correlation links between events. The
utilization of insights generated through Big Data Analytics in companies, universities and institutions provides for
an adaptation to a new culture of decision making (McAfee & Brynjolfsson 2012) and an evolution of the scientific
method (Anderson 2007), both of which are still to be built and provide opportunities for future research.
Being aware of the limitations of Big Data Methods and potential methodological issues is a fundamental
resource for organizations who want to drive data-based decision making: for example, predictions should always be
accompanied by valid confidence intervals in order to avoid the false sense of precision that the apparent
sophistication of some Big Data applications can suggest. Analysts should also be capable of avoiding models’
overfitting that would facilitate apophenia, i.e. the tendency of humans to “see patterns where none actually exist
simply because enormous quantities of data can offer connections that radiate in all directions”, (Boyd & Crawford
2012).
In a summary, Big Data requires the mastery of specific techniques, awareness of their strengths and limitations,
and a spread cultural tendency to informed decision making that in most cases has still to be built.
How Big Data Changes our Lives: Impact
The extent to which Big Data is impacting our society and our companies is often depicted through anecdotes
and success stories of methods and technology implementations. When these stories are accompanied by proposals
of new principles and methodological improvements they represent a valuable contribution to the creation of
knowledge on the subject. The pervasive nature of the current information production and availability leads to many
applications spanning in numerous scientific fields and industry sectors that can be very distant from each other.
Sometimes, the same techniques and data have been applied to solve problems in distant domains. For example,
correlation analysis was leveraged to use logs of Google searches to forecast influenza epidemics (Ginsberg et al.
2009) as well as unemployment (Askitas & Zimmermann 2009) and inflation (Guzman 2011). The existing Big Data
applications are many and expected to grow: hence, their systematic description constitutes a promising
development area for those willing to contribute in the scientific progress in this field.
Big Data can also impact society adversely. In fact, there are multiple concerns arising from the quick
advancement of Big Data (Boyd & Crawford 2012) first being privacy. Although large data sets would normally
proceed from actions done by a multitude of individuals, it is not always true that consequences of using that data
will not impact a single individual in an invasive and/or unexpected way. The identifiability of the individual person
can be avoided through a thorough anonymization of the data set, although it is hard to be fully guaranteed as the
reverse process of de-anonymization can be potentially attempted (Narayanan & Shmatikov 2008). The
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
predictability of future actions, made possible by the analysis of behavioral patterns, poses also the ethical issue of
protecting free will in the future, on top of freedom in the present.
Other issues to be considered are related to the accessibility of information: the exclusive control over data
sources can become an abuse of dominant position and restrict competition by posing unfair entrance barriers to the
marketplace. For example, as Manovich notices (2011), “only social media companies have access to really large
social data – especially transactional data” and they have full control over who can access what information. The
split between information-rich and data-lacking companies can create a new digital divide (Boyd & Crawford 2012)
that can slow down innovation in the sector. Specific policies will have to be promoted and data is likely to become
a new dimension to consider within antitrust regulations.
Not only society but also companies are heavily impacted by the rise of Big Data: the call to arms for acquiring
vital skills and technology to be competitive in a data-driven market implies a serious reconsideration of the firm
organization and the full realm of business processes (Pearson & Wegener 2013). The transformation of data into
competitive advantage (McAfee & Brynjolfsson 2012) is what makes “Big Data” such an impactful revolution in
today’s business world.
FIGURE 2. Big Data key topics in existing research.
A DEFINITION FOR BIG DATA
A convincing definition of a concept is an enabler of its scientific development. As Ronda-Pupo and Guerras-
Martin (2012) suggest, the level of consensus shown by a scientific community on a definition of a concept can be
used as a measure of progress of a discipline. Big Data has instead evolved so quickly and disorderly that such a
universally accepted formal statement denoting its meaning does not exist. There have been many attempts of
definition for Big Data, more or less popular in terms of utilization and citation. However, none of these proposals
has prevented authors of Big Data-related works to extend, renovate or even ignore previous definitions and propose
new ones. Although Big Data is still a relatively young concept, it certainly deserves an accepted vocabulary of
reference that enables the proper development of the discipline among cognoscenti and practitioners.
Information Technology
Methods
Impact
Big Data
Parallel
Computing
Machine
Learning
Programming
Paradigms
Value
Creation
Privacy
Emerging
Skills
Applications
Internet of
Things
Datafication
Distributed
Systems
Storage
Capabilities
Decision
Making
Organizations
Visualization
Overload
Diverse
Unstructured
Society
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
In the first part of this paper we have identified the four main themes of Big Data and we have observed that they
are the prevalent topics in the existing literature. In the next paragraphs we will review a non-exhaustive list of
previously proposed Big Data definitions and we will conceptually tie them to the aforementioned four themes of
research. After considering the existing definitions and analyzing their commonalities we will propose a consensual
definition of Big Data. Consensus in this case comes from the acknowledgement of centrality of some recurring
attributes associated to Big Data, and from the assumption that they define the essence of what Big Data means to
scholars and practitioners today. We expect that such a definition would be less prone to attack from previous
definitions’ authors and users as it is based on the most central aspects associated until now to Big Data.
A thorough consensus analysis based on Cohen’s K coefficient (1960) and co-word analysis, as in (Ronda-Pupo
& Guerras-Martin 2012), goes beyond the scope of this work and is left for future study.
Survey of Existing Definitions
Big Data has been often described “implicitly” through success stories or anecdotes, characteristics,
technological features, emerging trends or its impact to society, organizations and business processes. In the existing
attempts of explicit definitions for Big Data there is not even an agreement on what entity this term can be
associated with. We have found that Big Data is used when referring to a variety of different entities including - but
not limited to - social phenomenon, information assets, data sets, analytical techniques, storage technologies,
processes and infrastructures. We have surveyed multiple definitions that have been proposed to date and listed them
in Tab. 1: in this paragraph we will go through the most notable ones.
A first group of Big Data definitions focuses on enlisting its characteristics. What is probably the most popular
definition falls within this group. When presenting the Data Management challenges that companies had to face in
response to the rise of e-commerce in the early 2000’s, Laney introduces a framework expressing the 3-dimensional
increase in data Volume, Velocity and Variety and invokes the need for new formal practices that will imply
“tradeoffs and architectural solutions that involve/impact application portfolios and business strategy decisions”
(2001). Although this work did not mention Big Data explicitly, the model, later nicknamed as “the 3 V’s”, was
associated to the concept of Big Data and used as its definition (Beyer & Laney 2012; Eaton et al. 2012; Zaslavsky
et al. 2013). Many other authors extended the “3 V’s” model and, as a result, multiple features of Big Data, like
Value (Dijcks 2012), Veracity (Schroeck et al. 2012), Complexity and Unstructuredness (Intel 2012; Suthaharan
2013), were added to the list.
A second group of definitions emphasizes the technological needs behind the processing of large amounts of
data. According to Microsoft, Big Data is about applying “serious computing power” to massive sets of information
(2013) and also the National Institute of Standards and Technology (NIST) highlights the need for a “scalable
architecture for efficient storage, manipulation, and analysis” when defining Big Data (2014).
A few definitions associate Big Data to the crossing of some sort of threshold: for instance Dumbill (2013)
asserts that data is Big when it “exceeds the processing capacity of conventional database systems” and requires the
choice of “an alternative way to process it”. Fisher (2012) acknowledges that the size that constitutes “big” has
grown according to Moore’s Law and links the absolute level of this threshold to the capacity of commercial storing
solutions: Big Data “is so large as to not fit on a single hard drive” and, hence “will be stored on several different
disks”.
A last group of definitions highlights the impact of Big Data advancement on society. Boyd and Crawford (2012)
notice that “Big Data is less about data that is big than it is about a capacity to search, aggregate, and cross-
reference large data sets”. They define Big Data as “a cultural, technological, and scholarly phenomenon” that rests
on the interplay of Technology (maximizing computation power and algorithmic accuracy), Analysis (to identify
patterns on large data sets) and Mythology (meaning the belief that large data sets offer a higher form of intelligence
with an aura of truth, objectivity and accuracy). Mayer-Schönberger and Cukier (2013) describe Big Data by
enlisting the three key “shifts in the way we analyze information that transform how we understand and organize
society”: 1. “More data”, in terms of “completeness” of the data set, using all of available data instead of a sample
of it; 2.”More messy”, meaning that we can loosen up on our desire for exactitude and use also incomplete or less
accurate input data; 3. “Correlation” becomes more important and overtakes “causality” as a way to make sense of
trends and finally make decisions.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
TABLE 1. Existing definitions of Big Data, adapted from the articles referenced in the first column. The last four columns
indicate whether the definition alludes to each of the four Big Data themes identified in the first section of the paper, through the
following legend: I - Information, T - Technology, M - Methods, P - Impact.
Source
Definition
I
T
M
P
(Beyer & Laney 2012)
High volume, velocity and variety information assets that
demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.
x
x
x
(Dijcks 2012)
The four characteristics defining big data are Volume,
Velocity, Variety and Value.
x
x
(Intel 2012)
Complex, unstructured, or large amounts of data.
x
(Suthaharan 2013)
Can be defined using three data characteristics: Cardinality,
Continuity and Complexity.
x
(Schroeck et al. 2012)
Big data is a combination of Volume, Variety, Velocity and
Veracity that creates an opportunity for organizations to gain
competitive advantage in today’s digitized marketplace.
x
x
(NIST Big Data Public
Working Group 2014)
Extensive datasets, primarily in the characteristics of volume,
velocity and/or variety, that require a scalable architecture for
efficient storage, manipulation, and analysis.
x
x
(Ward & Barker 2013)
The storage and analysis of large and or complex data sets
using a series of techniques including, but not limited to:
NoSQL, MapReduce and machine learning.
x
x
x
(Microsoft 2013)
The process of applying serious computing power, the latest
in machine learning and artificial intelligence, to seriously
massive and often highly complex sets of information.
x
x
x
(Dumbill 2013)
Data that exceeds the processing capacity of conventional
database systems.
x
x
(Fisher et al. 2012)
Data that cannot be handled and processed in a
straightforward manner.
x
x
(Shneiderman 2008)
A dataset that is too big to fit on a screen.
x
(Manyika et al. 2011)
Datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.
x
x
x
(Chen et al. 2012)
The data sets and analytical techniques in applications that are
so large and complex that they require advanced and unique
data storage, management, analysis, and visualization
technologies.
x
x
x
(Boyd & Crawford
2012)
A cultural, technological, and scholarly phenomenon that rests
on the interplay of Technology, Analysis and Mythology.
x
x
x
(Mayer-Schönberger &
Cukier 2013)
Phenomenon that brings three key shifts in the way we
analyze information that transform how we understand and
organize society: 1. More data, 2. Messier (incomplete) data,
3. Correlation overtakes causality.
x
x
x
Consensual Definition
By looking at both the existing definitions of Big Data and at the main research topics associated to it, we can
affirm that the nucleus of the concept of Big Data can be expressed by:
‘Volume’, ‘Velocity’ and ‘Variety’, to describe the characteristics of Information involved;
Specific ‘Technology’ and ‘Analytical Methods’, to clarify the unique requirements strictly needed to
make use of such Information;
Transformation into insights and consequent creation of economic ‘Value’, as the principal way Big
Data is impacting companies and society.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
We believe that the “object” to which Big Data should refer to in its definition is ‘Information assets’, as this
entity is clearly identifiable and is not dependent on the field of application.
Therefore, we propose the following formal definition:
“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to
require specific Technology and Analytical Methods for its transformation into Value.”
Such a definition of Big Data is compatible with the existence of terms like “Big Data Technology” and “Big
Data Methods” that should be used when referring directly to the specific technology and methods mentioned in the
main definition.
CONCLUSION
Big Data has recently become a voguish term among researchers and IT professionals. Its success is propelled by
a frequent utilization in a broad range of contexts and with several, and often incongruous, acceptations. As a result,
its meaning is still nebulous and this hinders an organized evolution of the subject.
We have conducted an analysis of the usage of this term in literature and concluded that the top four themes
associated to Big Data are: Information, Technology, Methods and Impact. We have then have suggested a
definition that is coherent with the current “as is” utilization of the term and consensual with the most prominent
definitions that have been so far proposed. We suggest using Big Data as a standalone term when referring to those
“Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value” and as an attribute when denoting its peculiar requisites, e.g.
“Big Data Technology” or “Big Data Analytical Methods”. We believe that using this definition from now on will
allow a more efficient scientific development of the matter.
Possible extensions to the present work include:
A systematic literature review of “Big Data” by means of quantitative methods, such as co-word,
cluster and frequency analysis. The review should also identify a more granular list of research topics
through systemic methods like topic modeling.
Study of how Big Data is systematically impacting on the creation of economic value in companies and
a proposal of guidelines for a coherent development of system and processes related to Business
Intelligence and Analytics. We can presume that the value creation chain would go through the four
themes of Big Data and that maximizing the value each component brings would generate higher
returns on BI&A investments.
REFERENCES
Anderson, C., 2007. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired, p.3.
Askitas, N. & Zimmermann, K.F., 2009. Google Econometrics and Unemployment Forecasting. Applied Economics
Quarterly, 55(2), pp.107–120.
Atzori, L., Iera, A. & Morabito, G., 2010. The Internet of Things: A survey. Computer Networks, 54(15), pp.2787–
2805.
Beyer, M.A. & Laney, D., 2012. The Importance of “Big Data”: A Definition. Gartner Publications, pp.1–9.
Boyd, D. & Crawford, K., 2012. Critical Questions for Big Data. Information, Communication & Society, 15(5),
pp.662–679.
Buhl, H.U. et al., 2013. Big Data. Business & Information Systems Engineering, 5(2), pp.65–69.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
Chen, H., Chiang, R. & Storey, V., 2012. Business Intelligence and Analytics: From Big Data to Big Impact. MIS
Quarterly, 36(4), pp.1165–1188.
Chui, M., Löffler, M. & Roberts, R., 2010. The Internet of things. McKinsey Quarterly, 291(2), p.10.
Cohen, J., 1960. A coefficient of agreement of nominal scales. Educational and Psychological Measurement, 20,
pp.37–46.
Coyle, K., 2006. Mass Digitization of Books. Journal of Academic Librarianship, 32(6), pp.641–645.
Davenport, T.H. & Patil, D.J., 2012. Data Scientist: The Sexiest Job Of the 21st Century. Harvard Business Review,
90(10), pp.70–76.
Dean, J. & Ghemawat, S., 2008. MapReduce: Simplified Data Processing on Large Clusters. Communications of the
ACM, 51(1), pp.1–13.
Dijcks, J., 2012. Oracle: Big data for the enterprise. Oracle White Paper, (June).
Dumbill, E., 2013. Making Sense of Big Data. Big Data.
Eaton, C. et al., 2012. Understanding Big Data, McGraw-Hill Companies.
Estrin, D. et al., 2002. Connecting the physical world with pervasive networks. IEEE Pervasive Computing, 1(1),
pp.59–69.
Evans, D., 2011. The Internet of Things - How the Next Evolution of the Internet is Changing Everything. CISCO
white paper, (April), pp.1–11.
Fisher, D. et al., 2012. Interactions with Big Data Analytics. interactions.
Gartner, 2014. Gartner Says the Internet of Things Will Transform the Data Center.
Ghemawat, S., Gobioff, H. & Leung, S.-T., 2003. The Google file system. ACM SIGOPS Operating Systems
Review, 37(5), p.29.
Ginsberg, J. et al., 2009. Detecting influenza epidemics using search engine query data. Nature, 457(7232),
pp.1012–1014.
Guzman, G., 2011. Internet search behavior as an economic forecasting tool: The case of inflation expectations.
Journal of economic and social measurement, 36(3), pp.119–167.
Hilbert, M. & López, P., 2011. The world’s technological capacity to store, communicate, and compute information.
Science (New York, N.Y.), 332(6025), pp.60–65.
Intel, 2012. Big Data Analytics. Intel’s IT Manager Survey on How Organizations Are Using Big Data.
Laney, D., 2001. 3D data management: Controlling data volume, velocity and variety. META Group Research Note,
6 (February 2001).
Manovich, L., 2011. Trending: The Promises and the Challenges of Big Social Data. Debates in the Digital
Humanities, pp.1–10.
Preprint copy
Presented at “4th International Conference on Integrated Information”.
Accepted for publication in “AIP Proceedings”, exp. end 2014.
Manyika, J. et al., 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global
Institute.
Mayer-Schönberger, V. & Cukier, K., 2013. Big Data: A Revolution That Will Transform How We Live. Work and
Think, London: John Murray.
McAfee, A. & Brynjolfsson, E., 2012. Big data: the management revolution. Harvard business review, (October
2012).
Michel, J.-B. et al., 2011. Quantitative analysis of culture using millions of digitized books. Science (New York,
N.Y.), 331(6014), pp.176–82.
Microsoft, 2013. The Big Bang: How the Big Data Explosion Is Changing the World.
Moore, G.E., 2006. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38,
number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Newsletter, 20(3).
Narayanan, A. & Shmatikov, V., 2008. Robust de-anonymization of large sparse datasets. In Proceedings - IEEE
Symposium on Security and Privacy. pp. 111–125.
NIST Big Data Public Working Group, 2014. Big Data Interoperability Framework: Definitions (draft).
Pearson, T. & Wegener, R., 2013. Big Data: The organizational challenge.
Ronda-Pupo, G.A. & Guerras-Martin, L.Á., 2012. Dynamics of the evolution of the strategy concept 1962-2008: a
co-word analysis. Strategic Management Journal, 33(2), pp.162–188.
Russom, P., 2011. Big data analytics. TDWI Best Practices Report, Fourth Quarter.
Schroeck, M. et al., 2012. Analytics: The real-world use of big data.
Shneiderman, B., 2008. Extreme visualization: squeezing a billion records into a million pixels. International
conference on Management of data, pp.3–12.
Shvachko, K. et al., 2010. The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies, MSST2010.
Suthaharan, S., 2013. Big Data Classification: Problems and challenges in network intrusion prediction with
machine learning. Big Data Analytics workshop.
Viegas, F.B. et al., 2007. Many Eyes: A site for visualization at internet scale. IEEE Transactions on Visualization
and Computer Graphics, 13(6), pp.1121–1128.
Ward, J. & Barker, A., 2013. Undefined By Data: A Survey of Big Data Definitions. arXiv preprint
arXiv:1309.5821.
Xiong, W. et al., 2013. A characterization of big data benchmarks. In Big Data, 2013 IEEE International
Conference on. pp. 118–125.
Zaslavsky, A., Perera, C. & Georgakopoulos, D., 2013. Sensing as a service and big data. arXiv preprint.