ArticlePDF Available

Review of social media analytics process and Big Data pipeline

Authors:
  • Data Engineering and Semantics Resaerch Unit. Faculty of Sciences of Sfax. University of Sfax. Tunisia

Abstract and Figures

Social media analytics is a research axis focused on extracting useful insights from social media data, with the aim of helping individuals and organizations take the most optimum decisions regarding several disciplines of life (business, marketing, politics, health, etc.). In this respect, social networks, microblogging, and media-sharing websites represent striking instances of online social media, as constructed under the Web 2.0 associated technologies, targeted to promote the interaction between users and these websites, while shifting the user’s position from that of a mere consumer to that of a social data producer. Hence, huge amounts of social data turn out to be issued, thus turning into critical sources of Big Data. Actually, the traditional media analytical techniques seem obsolete and inadequate to process this huge array of unstructured social media and capture the massive data range, mainly the shifting from the batch scale to the streaming one. Such a process has culminated in injecting Big Data technologies throughout the analysis process. So, the present survey is targeted to help the concerned researchers identify the challenges encountered during the analysis process along with Big Data solutions. Indeed, the aim lies in providing a clear analytical process applicable with Big Data technologies. A systematic literature review is conducted to address the challenges facing integration of Big Data technologies, while displaying some adequate solutions. Following extensive literature search, an overall global view concerning the superposition of the social media analytics and Big Data technologies has been drawn and discussed, along with a promising potential research trend.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
1 3
Social Network Analysis and Mining (2018) 8:30
https://doi.org/10.1007/s13278-018-0507-0
REVIEW ARTICLE
Review ofsocial media analytics process andBig Data pipeline
HibaSebei1 · MohamedAliHadjTaieb1 · MohamedBenAouicha1
Received: 29 August 2017 / Revised: 25 March 2018 / Accepted: 27 March 2018
© Springer-Verlag GmbH Austria, part of Springer Nature 2018
Abstract
Social media analytics is a research axis focused on extracting useful insights from social media data, with the aim of helping
individuals and organizations take the most optimum decisions regarding several disciplines of life (business, marketing,
politics, health, etc.). In this respect, social networks, microblogging, and media-sharing websites represent striking instances
of online social media, as constructed under the Web 2.0 associated technologies, targeted to promote the interaction between
users and these websites, while shifting the user’s position from that of a mere consumer to that of a social data producer.
Hence, huge amounts of social data turn out to be issued, thus turning into critical sources of Big Data. Actually, the tradi-
tional media analytical techniques seem obsolete and inadequate to process this huge array of unstructured social media and
capture the massive data range, mainly the shifting from the batch scale to the streaming one. Such a process has culminated
in injecting Big Data technologies throughout the analysis process. So, the present survey is targeted to help the concerned
researchers identify the challenges encountered during the analysis process along with Big Data solutions. Indeed, the aim
lies in providing a clear analytical process applicable with Big Data technologies. A systematic literature review is conducted
to address the challenges facing integration of Big Data technologies, while displaying some adequate solutions. Following
extensive literature search, an overall global view concerning the superposition of the social media analytics and Big Data
technologies has been drawn and discussed, along with a promising potential research trend.
Keywords Big Data pipeline· Online social media· Social Big Data· Social media analytics· Big Data challenges· Big
Data technologies
1 Introduction
It is worth mentioning that online social media websites
stand as a critically important platform of Big Data sources
(Gandomi and Haider 2015; Yaqoob etal. 2016) mainly
involving online social networking websites (e.g., Face-
book, MySpace, etc.), multimedia sharing websites (e.g.,
YouTube, Instagram, etc.), and microblogging websites
(e.g., Twitter). Given the wide spread deployment of these
websites worldwide, a huge amount of data is generated in
a scale of seconds.
According to a report published by the We are Social and
Hootsuite1 dealing with the digital world, in 2018, it has
been reported that more than half of the world’s population
is presently using the Internet, 42% among them are active
social media users and about 39% are active mobile social
users.
In addition to the large-scale application of social media,
these websites are constructed under the Web 2.0 technology
(Cormode and Krishnamurthy 2008; Newman etal. 2016;
Zeng etal. 2010), which makes the exploitation of the mas-
sively user-generated data amounts an effective profit for a
wide range of domains such as politics (Stieglitz and Dang-
Xuan 2013), through the forecasts of the election result, the
biomedical area (Kotsilieris etal. 2017; Ji etal. 2017), as
well as the business-decision-making process (Rahmani
etal. 2014), through assessments of the costumers’ attitudes
as available in the various social-related media sites (He
etal. 2017). Thus, the rise of social media analytics as a
* Hiba Sebei
hiba.enis@gmail.com
1 Multimedia, InfoRmation systems andAdvanced Computing
Laboratory, Sfax University, Sfax, Tunisia 1 https ://weare socia l.com/blog/2018/01/globa l-digit al-repor t-2018.
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 2 of 28
process (Gandomi and Haider 2015) aims to extract useful
knowledge from social data through the collection, clean-
ing, and analyzing of the social. Actually, what matter most
about data are not data themselves but rather, information
and knowledge they contain. Several studies that have been
conducted (Chen etal. 2009; Rowley 2007; Stenmark 2002)
discuss the difference between data, information, and knowl-
edge, and devise a special hierarchy whereby to highlight the
connection persistent between them. In this respect, Ackoff
(1989) defines data as “symbols that represent the proper-
ties of objects and events data have no meaning, processing
this data leads to the producing of information that have
meaning and answer to question related to ‘who,’ ‘what,’
‘where,’ and ‘when’ and knowledge is about transform-
ing information into instructions and rules and providing
answers to ‘how’ questions.” As for Chen etal. (2009), they
establish a distinction, between the three terms in terms of a
computational space, with regard to their significance in the
computer memory. Accordingly, data turn out to refer to the
representations of models and attributes of real or simulated
entities, while information represents the results of a com-
putational process, such as statistical analysis, for assigning
meanings to the data, or the transcripts of some meanings
assigned by human beings, and knowledge designates “Data
that represent the results of a computer-simulated cogni-
tive process, such as perception, learning, association, and
reasoning, or the transcripts of some knowledge acquired
by human beings.”
In its complete form, social media analytics refers to
the traditional multiprocess task of analyzing the social
media-associated data. For the purpose of getting useful
insights, including such areas as: sentiment analysis (Li and
Wu 2010), sentiment classification, social network analysis
(Magnusson 2012) as well as data mining (Han etal. 2011a),
these analytical frameworks rest on a multitude of associated
techniques such as text mining, computational linguistics,
machine learning, and natural language processing.
In this regard, the social media websites stand as typical
examples of Big Data sources, characterized with an expo-
nential growth of heterogeneous data (videos, photographs,
texts, audios)which makes the social media analytical pro-
cess a challenging task, owing mainly to the traditional tech-
niques’ inefficiency to analyze this huge flow of social data
(Orgaz etal. 2016). Thisraises the need for new technolo-
gies to help in boosting and enhancing the performance of
traditional techniques and achieving reliable results based
on the implemented analysis (Peng etal. 2017; Sapountzi
and Kostas 2016). These novel technologies are basically
Big Data technologies which, once combined with conven-
tional social media analytics, prove to display a remarkable
potential enabling to process the flood of social media data.
Hence, the emergence of a new coined term dubbed Social
Big Data (SBD) or Big Social Data (BSD) to characterize
this new area of research defined by (Bohlouli etal. 2015;
He etal. 2017; Orgaz etal. 2016). SBD designates the joint
of Big Data technologies and frameworks with the tradi-
tional analysis techniques targeted to process and analyze
social media data and with the aim of deriving useful value.
Similarly, Nguyen etal. (2014), along with Vatrapu etal.
(2016), define SBD as being the social media data generated
as characterized with a huge massive volume of unstructured
data.
Noteworthy, however, is that the alliance between social
media analytics and Big Data does not seem to be explicitly
discussed. Indeed, an exhaustive estimation of the relevant
research works conducted between the years 2008 and 2017
involving the terms social media analytics review and Big
Data analytics review reveals well that these surveys appear
to predominantly focused on reviewing the software tools
as applied to social media scraping and analytics (Batrinca
and Treleaven 2015), highlighting the social media modeling
and analysis related techniques (Jure 2011), outlining the
different tools, methods, and techniques useful for analyz-
ing Big Data (Elgendy and Elragal 2014), treating the social
media messages’ analysis methods (Imran etal. 2015), and
investigating the state-of-the-art techniques associated with
analyzing social media pertaining data (Wu etal. 2016).
Worth citing, in this respect, are the work elaborated by
Orgaz etal. (2016) and the literature analysis conducted by
Stieglitz etal. (2018). The former is centered on reviewing
the MapReduce methodology, the frameworks that imple-
ment them (e.g., Hadoop, Spark, etc.), as well as social
media analytics methods and algorithms (e.g., community
detection, text analytics, etc.) designed to handle the flow of
Big Social Data along with the social data analysis-attached
applications. As regards the present research, it encompasses
not just a review of the methodologies, but also their catego-
rization in terms of their functions relevant to each of the
social data processing steps. We also provide an applica-
tion illustrating the different steps involved in the process,
through implementation of Big Data technologies. As for the
second work, it has been conducted by Stieglitz etal. (2018)
in the form of a structured literature analysis. Accordingly,
the authors undertake to define the social media analytics’
intervention area, highlighting their application in several
domains (politics, communication, business, etc.). Similarly,
they briefly enumerate the challenges associated with the
fact that social media commonly share the Big Data-related
characteristics (volume, variety, velocity, and veracity),
while exclusively focusing on the challenges as emanating
only during the first three steps (discovery, collection, and
preparation) of the analytics. Furthermore, Stieglitz etal.
(2018) proceed with categorizing the articles dealing with
the same specific step relating challenges and report the
solutions as mentioned in the state of the art. As for the pre-
sent conducted work, it proves to differ with regard to three
Social Network Analysis and Mining (2018) 8:30
1 3
Page 3 of 28 30
major points. In a first place, a noticeable attempt is made to
carry out a thorough discussion of how social media-related
data do actually enclose the Big Data-associated challeng-
ing aspects. Noteworthy, however, is that identifying just a
single Big Data-associated V aspect does not seem to con-
stitute a sufficiently reliable condition whereby the Big Data
processing-related problem could be wholly identified and
accounted for. It is actually in this respect that our contribu-
tion can be distinguished through theexplanation of how
the Big Data relevant aspects could be iteratively embodied
within the social data domain. The aim lies in helping the
users of social media analytics identify and ensure whether
they are actually faced with a Big Data-associated prob-
lem or not throughout their executed analysis. In a second
place, the solutions proposed in the present research go even
further as to outline each single step-related challenges. In
effect, a detailed description of the Big Data relevant solu-
tions, adequately fit for coping with each type of challenge,
is advanced, while establishing a comparison between them.
Finally, the newly advanced framework appears to differ in
terms of the steps to follow in order to analyze big social
media data, along with the techniques and technologies
applicable to deal with each one of them.
It is worth noting that the most of the existing research
works involve isolated case studies stressing the challenges
researchers often encounter when deploying specific meth-
ods to analyze social media data, such as the social network
analysis or opinion mining. Other researchers undertake to
deal with the difficulty of associating with particular related
problems (Imranet al. 2015). Actually, there exist no set
standards or guidelines that researchers can follow while
administrating the analysis whereby to identify the difficul-
ties likely to arise through the analysis process and how Big
Data could be used in the process.
In this respect, the present work is intended to provide an
explicit description of how social media analytics behaves
within a Big Data context. Accordingly, and for a compre-
hensive view of the combined scheme to be achieved, as
integrating social media analytics and Big Data technolo-
gies, we have formulated two main questions to which we try
to provide plausible answers through the undertaken survey,
namely:
What type of challenges do researchers encounter when
analyzing a big social data?
How could the Big Data technologies be effectively inte-
grated in such a way as to deal with such challenges?
Figure1 illustrates the motivation lying behind the con-
duction of the present work along with the research question.
The remainder of this research work is structured as
follows: In a first place, the paper-applied terminology
is thoroughly highlighted. In a second place, the pursued
methodology is displayed, whereby a systematic review is
established, along with a depiction of the achieved results, as
subject of Sect.3. As for Sect.4, it involves a discussion of
the major attained findings highlighting the explicit achieve-
ments reached following implementation of the combined
social media analytics process and Big Data architecture
along with a presentation of Big Data technologies-related
features. In a last stage, new research potential directions and
perspective lines are highlighted.
2 Background
This section is devoted to outline the major terminologies as
used in this survey, along with their respective definitions.
2.1 Big Data
It is worth highlighting that the International Data Corpo-
ration (IDC) has released that as much as 1.8 ZB of data
was created by the end of 2011 and predicted that no less
than 2.8 ZB of data would be issued in the next few years
(2016), while an amount of corporate 40 ZB of data would
be generated by enterprises by the year 2020. This flood
of data is subsumed by the new term called “Big Data,”
defined through responding to two major questions (Gan-
domi and Haider 2015): What is Big Data? What tasks does
it perform? As response to the first, Big Data denotes the
explosion of various data sources, such as social media and
mobiles. In the respect, the McKinsey report (2011) states
that Big Data is about “datasets whose size is beyond the
ability of typical database software tools to capture, store,
manage, and analyze.”(Manyika etal. 2011). In turn, (Ous-
sous etal. 2017) define the term as being “the large grow-
ing data sets that include heterogeneous formats: structured,
unstructured and semi-structured data.” Accordingly, Big
Data is considered as technologies useful for managing a
huge amount of data that cannot be managed through tradi-
tional technologies. Similarly, the IDC describes Big Data
as being a new generation of software and architectures
designed to economically extract value from very large
volumes of a wide variety of data through providing and
enabling high-velocity capture, discovery, and/or analysis.
No matter what definition-related question these defi-
nitions try to answer? and what tasks does it perform?,
the link between these responses provides an idea about
the term Big Data as being a standard notion enabling
to describe data through well-defined characteristics such
as volume and variety, and in parallel the technologies
involved in treating these enormous amounts of generated
data. Figure2 illustrates how often the term Big Data has
been looked up and searched even since the year 2004
relative to the total search volume across various world
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 4 of 28
regions. This term has been largely wide spread over the
last fewer years starting from the year 2011. It also depicts
that the search rate for the term has reached its peak over
the last 5years, as it has multiplied by eight. As docu-
mented by (Gandomi and Haider 2015) on introducing
a study made by the ProQuest Research Library dealing
with the frequency distribution of documents containing
the term Big Data, it has been revealed that by the begin-
ning of 2011, a rate of about 380 frequent distributions of
documents containing the term Big Data has been scored
and that this rate recorded an even more remarkable rate
in 2013 to reach an average of 1800 searches as a monthly
frequency.
2.2 The related Big Data Vs
The first reflection lying behind the term Big Data consists
in the volume denoting the data size or amount (Newman
Fig. 1 Illustration of the context
and the research questions
Social media analytics
Big Data technologies
Research Questions
What type of challenges do researchers encounter
when analyzing a big social data?
How could the big data technologies be effectively
integrated in such a way as to deal with such
challenges?
Extracted Knowledge
Business HeathPolitic
Input
Processing
Output
Fig. 2 Evolution of the search rate of term Big Data from 2004 until 2017 (made by Google trends)
Social Network Analysis and Mining (2018) 8:30
1 3
Page 5 of 28 30
etal. 2016). Noteworthy, however, is that the states of art
highlight a variety of dimensions characterizing these data
and their applications, emphasizing the added value they
provide. Three major dimensions associated with Big Data,
termed as the three Vs of Big Data, are volume, variety,
and veracity, which display a common consensus reached
among authors. It is actually, Laney, who initially described
Big Data through the three Vs, namely volume, variety, and
velocity (Uddin etal. 2014).
Volume represents the scale of generated data overcom-
ing the terabytes to reach petabytes and even exabytes. Data
are continuously generated from a multiplicity of sources
such as social media, cloud-based services (Amazon), enter-
prises-related data, and those pertaining to the Internet of
Things (IOT) (Khan etal. 2014; Storey and Song 2017).
An estimation made by Radicati and Hoang (2011) states
that the number of e-mail accounts created worldwide will
increase from 3.3 billion, in 2012, to over 4.3 billion by late
2016. The survey made by IBM in mid-2012 reveals that a
data amount exceeding one terabyte is ranked as Big Data
(Schroeck etal. 2012). This threshold amount is relative
(Gandomi and Haider 2015) as the data volume quantifica-
tion depends also on other factors such as time and data type.
Concerning the time factor, storage capacities will increase
allowing the management of bigger datasets. As for the data-
type factor, it is clear that one terabyte of a textual type
of data is not necessarily equal to a one terabyte of video-
type date. Hence, Big Data is not just about volume but it
includes other dimensions beginning with the initial letter
V culminating in the “Vs” of Big Data.
Variety describes the various data sources and types
(Chen and Zhang 2014). Data steaming from differ-
ent sources are characterized with different formats. For
instance, one could distinguish structured data that refer to
often managed Structured Query Language (SQL), a pro-
gramming language created for managing and querying
data within Relational Data Base Management Systems
(RDBMS) (Hashem etal. 2015). Structured data are easy
to input, query, and store. There are also data generated in a
semistructured format, such as Extensible Markup Language
(XML) and JavaScript Object Notation (JSON) data. Yet,
the main format characterizing Big Data is that pertaining to
unstructured data such as the multimedia-related data (vid-
eos, photographs, and audios) that do not take a fixed format
(Gandomi and Haider 2015), which makes its management
a serious challenge facing data scientists.
Velocity refers to the speed characterizing incoming and
outgoing data (Chen and Zhang 2014). In fact, the speed
marking the generated data is evaluated in terms of scale
of batch, near real time, and real time to reach stream-
ing. According to Yaqoob etal. (2016), the data velocity
depends highly on the proliferation of mobile devices and
other device sensors connected to the Internet. Additionally,
providing reasonable response time and updates turns out to
be a requirement and a reference whereby the applications
efficiency can be assessed, as confirmed by (Fan and Bifet
2013). Besides, managing and analyzing streaming data also
stand as extra challenges, requiring the application of rel-
evant techniques and technologies to handle (Orgaz etal.
2016). Actually, some other authors and companies attribute
other dimensions characterizing Big Data. For instance, IBM
and Microsoft “coined Veracity as the fourth V, which rep-
resents the unreliability inherent in some sources of data as
customer sentiments in social media are uncertain in nature”
(Gandomi and Haider 2015). In this respect, veracity refers
to data messiness and trustworthiness (Gandomi and Haider
2015). For Storey and Song (2017), veracity raises chal-
lenges related to data quality (Haryadi etal. 2016) closely
associated with accuracy, timeliness, currency, completeness
(Agrawal etal. 2012), consistency, and accessibility (Corbel-
lini etal. 2017) that should be handled by means of auto-
mated techniques. In turn, McKinsey and Oracle added the
notion of value as the fourth V associated with defining Big
Data (Chen and Zhang 2014). This value dimension refers
to the worthiness of hidden insights latent within Big Data
(Gandomi and Haider 2015). With respect to Wang etal.
(2017), Big Data is characterized with the 5V dimensions
through consideration of the both the value and veracity
aspects. Other authors worth citing among them are Uddin
etal. (2014) who talk even about seven Vs attached to Big
Data and incorporating both of the validity and volatility
dimensions. They define validity as being the data correct-
ness and accuracy with regard to the intended usage, while
volatility designates the retention policy relating structured
data as frequently implemented in our businesses. Figure3
depicts the different dimensions associated with Big Data
and their respective designations.
2.3 Social media analytics
In this regard, Stieglitz etal. (2014) define social media ana-
lytics as “an emerging interdisciplinary research field that
aims on combining, extending, and adapting methods for
analysis of social media data.” Another definition introduced
by Zeng etal. (2010) considers social media analytics as
tools and frameworks whereby “to collect, monitor, analyze,
summarize, and visualize social media data, usually driven
by specific requirements from a target application.”
3 Research design
For an effective social media analytics process to take place
under a Big Data architecture, one has to primarily recog-
nize the reasons lying behind the implementation of Big
Data technologies as necessary conditions for extracting
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 6 of 28
knowledge from social data. In this regard, it seems appro-
priate to review the state-of-the-art social media analytics-
related challenges prior to discussing them. This section is
structured as follows: In a first stage, we propose to describe
the methods appealed to the establishment of this study. In a
second stage, the major challenges as drawn from the exam-
ined previously elaborated works are highlighted, mainly:
the lack of predefined steps for processing social media
analytics and the Big Data dimensions characterizing social
data. In an ultimate stage, some relevant solutions are pro-
posed in regard to each cited challenge.
3.1 The applied methodology
In this paper, a systematic review is thoroughly conducted.
The relevant data are manually collected through the exami-
nation of a paper search based on a set of relevant predefined
terms, through the available electronic databases, prior a
skimming reading of each retrieved work’s respective title
and abstract to determine their relevance. The applied meth-
odology is further detailed in the upcoming subsections.
3.1.1 Databases andterms
Our search method applied is basically database-oriented
and accounts for the entirety of published journal arti-
cles and conferences as extracted from four main biblio-
graphic databases dealing with the computer science-asso-
ciated areas, more particularly, the ACM (Association for
Computing Machinery), IEEE (Institute of Electrical and
Electronics Engineers) Xplore, Springer, and Elsevier. The
search focused on retrieving examples of social media ana-
lytics frameworks and identifying the perceived challenges
related to the Big Data aspect of social media data: the vol-
ume, variety, and velocity. To this end, a number of terms
have been searched: “social media analytics,” “social media
analysis,” “social data analysis” as synonyms describing the
same search area. These terms are jointly combined with
the terminologies “Big data challenges “and “Frameworks”
(Table1).
3.1.2 Inclusion andexclusion rules
The criteria applied to identify the convenient studies sat-
isfying the research question requirement among those col-
lected from the electronic databases are mainly:
The paper should be published between the year 2008
and the year 2018.
Fig. 3 Big Data dimensions
BigData
Refers
to the worth of hidden
insights
inside big data
(Gando
mi and Haider 2015)
Refers to the messiness
and trustworthiness of
data (Gandomi and
Haider 2015)
Refers the correctness and
accuracy of data with
regard to the intended
usage (Khan et al. 2014)
Refers to the retention policy
of structured data that we
implement every day in our
businesses (Khan et al. 2014)
01 Volume
02 Variety
03 Velocity
04 Variability
05 Value
06 Veracity
07 Validity
08 Volatility
Refers to the size of the
data (C. P. Chen and
Zhang 2014).
Describes the sources
and types of data (C. P.
Chen and Zhang 2014)
Refers to the speed of
incoming and outgoing
data (C. P. Chen
and
Zhang 2014)
Refers to the variation in the
data flow rates(Gandomi
and Haider 2015)
Table 1 Used search terms and electronic databases
Search terms Electronic databases
Social media analytics Frameworks ACM
IEEE Xplore
Springer
Elsevier
OR Social media analysis
AND
OR Social data analysis Big Data challenges
Social Network Analysis and Mining (2018) 8:30
1 3
Page 7 of 28 30
The paper should be developed in English.
The papers should involve processing steps related to the
social media analytics-attached task.
The papers should discuss the social media analytics-
associated challenges as based on the Big Data-related
four Vs (volume, variety, velocity, and veracity). The
4 Vs related to Big Data are exclusively considered
based on the following bases: The value dimension as
determined by Sapountzi and Kostas (2016) stands as
the process adopted for extracting insights and useful
information from data. In addition, different techniques
devised by Zeng etal. (2010) to extract value from data
such as machine learning, data mining, statistics, opti-
mization, and decision support analysis are also imple-
mented. Actually, this dimension could well stand as a
prerequisite for the processing of Big Data rather than a
Big Data-associated dimension. Concerning the volatility
and validity dimensions, they depend highly on the social
data application domain.
3.2 Results overview
The query is implemented on the basis of the above prede-
fined search terms, and interrelating combinations binding
them, as implemented to the four stated electronic databases
along with the application of the inclusion and exclusion
criteria, have yielded the relevant articles. The relevant arti-
cles are categorized into three main categories: First, papers
discuss a social media analytics process by describing the
different followed steps of the analysis. Second, papers dis-
cuss social media analytics challenges due to the Big Data
dimensions characterizing the analyzed social data and pro-
posing some solutions. Finally, papers present social media
analytics frameworks based Big Data architecture. Table2
sums up the search results associated with the relevant
papers concerning each of the databases.
3.2.1 Lack ofpredened step foranalyzing social media
data
Among the relevant papers collected before, this subsec-
tion focuses on the set of papers that present social media
analytics frameworks and describe the followed steps to
extract useful knowledge from social data.
3.2.1.1 Findings The designed papers present a set of
frameworks in several domains: health (Abbasi etal. 2014;
Dredze 2012), emergency situation (Avvenuti etal. 2016),
business (Wang etal. 2016), etc. We investigate for each
domain the presented frameworks and summarize the input,
output, and the followed steps in each framework. The
results are summarized in Table3. The table illustrates sev-
eral frameworks categorized according to their application
fields. It depicts major highlighted frameworks pertaining
to the political context as (Skoric etal.2012; Stieglitz and
Dang-Xuan 2013; Yaqub etal. 2017) in which the authors
appear to collect data from social media websites (e.g.,
Twitter) and analyze them in a bid to investigate the user’
s behavior as predominating prior and following the US
election case. Stieglitz etal. (2018) extend the framework
introduced in (Stieglitz and Dang-Xuan 2013) by adding
the challenges rising in each step of the framework. The
table also illustrates the nature of people’s discussion and
sentiment regarding the concerned politicians. Besides, the
authors seem to apply the social media-related data in a bid
to predict the election results. The frameworks rest on the
implementation of several techniques such as opinion min-
ing and sentiment analysis (Stieglitz and Dang-Xuan 2013),
while the analysis-reached results are reported using dash-
boards and curves. In addition, Table3 also illustrates some
frameworks pertaining to the healthcare domain (Ji et al.
2017; Kanhabua etal. 2012a). The two applied frameworks
rely on the analysis of healthcare relating social media data.
Each one of the frameworks appears to address a specific
subject, as it the case of (Ji etal. 2017), where the authors
focus on using social media data for the purpose of search-
ing information related to specific disease. The framework
is destined to both patients and doctors alike, as it enables
them to execute search information related to symptoms and
medicines. As for the framework introduced by (Kanhabua
etal. 2012a), it is aimed to track the temporal developments
of outbreak mentions in Twitter as a helpful tool for detect-
ing early warnings for a rapid response from health authori-
ties to take place. Even though both frameworks rest on
analyzing Twitter data, the steps of analysis differ in terms
of technologies applied with respect to the storage (RDF
database) and visualization tools, while the processing logic
remains the sameanalysis steps: collect, cleanse, store, ana-
lyze and visualize of the analysis results. Other frameworks
attached to predicting natural disaster are also introduced
in this table as (Avvenuti etal. 2014; Sakaki et al. 2013;
Win and Aung 2017) based on analyzing tweets extracted
from Twitter microblog website. The relevant data are col-
lected, filtered, and then classified. These frameworks dis-
play a variety of classification methods and techniques. In
Table 2 Number of research per database
Electronic databases Number of papers
identified in search
Number of papers
meeting inclusion
criteria
ACM digital library 45 20
Springer 10 4
IEEE Xplore 64 31
Elsevier 36 6
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 8 of 28
Table 3 Social media analytics-based frameworks
Field Ref Input Analysis steps Output
Politic (Stieglitz and Dang-Xuan 2013) Status updates and corresponding
comments from public Facebook
profiles and pages: Facebook
Graph API
Public tweets: Twitter search API
and Twitter Streaming API
Blog messages from Web blogs:
RSS Feeds and HTML parsing
Data tracking and monitoring: refers
to the choice of the appropriate
data tracking sources, and the
methods used to extract data (as
keyword/topic-based, actor-based
approach, etc.)
Data preprocessing: prepare textual
data by eliminating stop words,
stemming, and lemmatization
Data analysis: refers to social
network analysis, opinion mining,
sentiment analysis, text analysis,
etc.
Visual representation of the analysis
Dashboard and reports
(Yaqub etal. 2017) Tweets: the streaming API Collect tweets using keywords
Data cleaning and extraction
Sentiment tagging and classification
of the gathered tweets
Importing data in the MySQL data-
base to perform exploration
Development of user behavioral
model, formulating hypotheses
and deriving findings approving
or disapproving the hypotheses for
the analysis of data
Diagrams show the frequency
and quantify sentiment based on
extracted tweets
(Skoric etal. 2012) Tweets published by a set of selected
Singapore-based Twitter users
Data collection (tweets using Twitter
Rest API)
Data storage: MySQl
Data measures
Curve drawing to show the correlation
between the tweets and votes
Health (Ji etal. 2017) Twitter and health care websites Data collection and filtering
Data storage in RDF database
Developing analytic service as
recommendation and statistical
services
Visualizing the result for users
(patients and clinics)
User dashboard shows information
(e.g., drugs and conditions) con-
cerning a target disease
(Kanhabua etal. 2012a) 3000 official outbreak reports pub-
licly available from the external
resources (WHO and ProMED-
mail)
Twitter collection consists of over
112 million of tweets
Data collection
Tweet processing (filtering the
relevant tweets related to the
outbreak)
Data analysis: drawing the evolution
of the tweets-related outbreak
during time
User dashboard visualizes the tempo-
ral development of an outbreak, and
the target place is Bangladesh
Social Network Analysis and Mining (2018) 8:30
1 3
Page 9 of 28 30
Table 3 (continued)
Field Ref Input Analysis steps Output
Other
Urban (Bocconi etal. 2015) Social media (e.g., Twitter,
Instagram, Foursquare), mobile
phone data, spatial statistics, and
demographics
Ingestion and analysis: responsi-
ble for acquiring, cleansing, and
analysis of social data
Fusion tier caters for the integration
interoperability issues across differ-
ent data sources and usage domains
Exploration and visualization: user
interfaces for data exploration,
comparison, and urban analytics
Map-based visualizations that show
clustered points, choropleth, and
path
Business (Oh etal. 2015) Tweets Capturing social media data
through: the identification of rel-
evant keywords, the data extraction
based on these keywords (tweets
related to the keywords)and using
the Twitter Search API
Preprocessing the extracted data
which refer to data filtering using
text mining techniques (“remove
irrelevant tweets or to assign tweets
such as in case of tweets belonging
to more than one ad or brand”)
Understanding data by defining the
relevant measures and analyzing
the data (sentiment analysis tech-
niques, text mining, etc.
Presentation: summarizing and report-
ing the analysis results
Marketing (Bothos etal. 2010) Tweets Content extraction from social media
by customized parsers for each
source (Flixster.com, IMDb.com,
Twitter, etc.) Execute Web queries
Execute targeted queries at the
microblogging application Twitter
Processing social information by pro-
cessing of rating sentiment analysis,
query analysis market-based collec-
tive intelligence with artificial agents
Reports
Content analysis (DWFP) (Dang etal. 2014) Tweets Data integration through data pars-
ing and collection
Storing the collected data in unified
database
Developing search support using
keyword-based functions
Developing the multilingual transla-
tion support function through the
use of Google Translation API
Visualizing the search results using
a unified interface design for users
through application of Java Server
Pages (JSP) technology
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 10 of 28
Table 3 (continued)
Field Ref Input Analysis steps Output
General purpose (Stieglitz etal. 2018) Data based on the research domain
(e.g., marketing data, political
data, business data, etc.)
Data tracking and storage: track-
ing based on a set of approaches
(e.g., keyword actor and URL
related) and methods (e.g., API,
RSS/HTML parsing), storage on
databases
Data preparation
Data analysis based on set of
approaches (e.g., structural attrib-
ute, topic/trend related) and meth-
ods (e.g., social network analysis,
content analysis, etc.)
Reports
Disaster (Win and Aung 2017) Tweets Tweets collection: Twitter stream-
ing API
Tweets preprocessing: reduce the
redundancy and noise
Feature extraction: linguistic
Features detection such as Word
N-grams, POS features, sentiment
Lexicon features using NRC
Hashtag Sentiment Lexicon
Creating disaster lexicons from
annotated tweets
Classification of tweets: LibLinear
classifier
The corpus for the searched disaster
contains the target relevant tweets
(Sakaki etal. 2013) Tweets including keywords
related to a target event
Crawl tweets: Twitter search API
Classifying tweets into positive and
negative tweets, where positive
means that a tweet is truly refer-
ring to an actual contemporaneous
earthquake occurrence
Event detector
Location estimation
Visualize earthquake location
estimation based on tweets using a
geographic map.
(Avvenuti etal. 2014) Tweets from Twitter Data acquisition: collect data based
on keywords
Data filtering: filter the noise from
the collected data
Event detection
Web application designed to show
temporal, geographical, and content
analyses at both of the event and
message level
Social Network Analysis and Mining (2018) 8:30
1 3
Page 11 of 28 30
addition, the table also contains some other frameworks
devoted to serve other purposes such as domain of business
(Oh etal. 2015), marketing (Bothos etal. 2010), and urban
development (Bocconi etal. 2015).
3.2.1.2 Discussion: a Big Data pipeline for encapsulating
social media analytics The reached result appears to reveal
well that despite the variety of these frameworks-associated
applications fields, they share some points in common such
as the input which is extracted from the social media via
their available APIs (e.g., Twitter Rest API, Facebook Graph
API, etc.). The same applies to the output which refers to the
value extracted from the analyzed social data along with the
visual representation of their analysis (e.g., report, graphs;
maps). Noteworthy, however, and as Table3 analysis steps
indicate, is that each of these frameworks proves to undertake
specific steps to extract value from social data. A number of
studies maintain the absence of clear processes whereby the
steps could be defined to derive and extract useful informa-
tion from social data, as documented by Peng etal. (2017),
who state the lack of standardization with regard to the pro-
cessing of social networking data. Nevertheless, following
the emergence of social data analytics, it has become crucial
to identify the involved steps necessary for constructing a
clear view for companies and researchers as to how these
data could be managed (Cambria et al. 2014). The state-
of-the art framework appears to reveal the persistence of
an inconsistency prevailing among/between the research
works with respect to the steps to address during the social
media analytics process. As far as Stieglitz etal. (2018) are
concerned, a sample of social media analytics framework
has been proposed, whereby four main steps are defined,
namely: the discovery, tracking, preparation, and analysis
steps. For each step to be well determined, the authors set
a selection of convenient methods and approaches. Further-
more, their devised framework undertakes to provide infor-
mation about the challenges likely to emerge with respect
to each step. Based on the approaches and methods already
outlined throughout the present survey, along with those
discussed by Stieglitz and Dang-Xuan (2013), as well as the
Big Data aspects characterizing social data, we propose to
put forward a new social media analytics relating scheme.
As already stated, the newly advanced social media relating
analytical frameworks turn out to differ in terms of the steps,
as well as techniques and technologies liable for implemen-
tation with respect to each of them. Concerning the relevant
techniques and technologies relative to each specific step,
they make subject of a full section (Sect.4), while the sug-
gested steps’ identification is discussed in the section below.
Indeed, as social media prove to represent Big Data
sources, it sounds reasonable to implement the Big Data
processing pipeline media to stand as predefined steps for
encapsulating the social media analytics process.
Worth citing in this respect are Furht and Villanustre
(2016) who designed a special workflow whereby the dif-
ferent steps involved in analyzing Big Data are highlighted.
This Big Data-associated workflow outlines six different
steps necessary for the process, namely: data collection
that refers to extracting data from different sources and
under different formats (structured and unstructured data),
ingestion which refers to “loading vast amounts of data
onto a single data store,” discovery and cleansing which
determine the “understanding format and content; clean up
and formatting,” integration which designates the “link-
ing, entity extraction, entity resolution, indexing and data
fusion,” analysis which outlines such relevant techniques
as intelligence, statistics, predictive and text analytics,
machine learning to analyze Big Data, and delivery that
helps in setting the querying, visualization, real-time
delivery of the analysis’ achievedresults for enterprises.
Still, the admitted Big Data pipeline scheme as proposed
by Agrawal etal. (2015) presents a Big Data processing
pipeline involving the steps necessary to implement for
the processing of Big Data sources. Indeed, five distinct
steps are reckoned necessary by the authors for processing
Big Data, namely:
Acquisition and recording This step describes the
process of collecting and storing data from different
sources.
Information extraction and cleaning: This step aims to
prepare and cleanse data for the processing step.
Data Integration, aggregation, and representation: This
step deals with enveloping data in suitable formats to fit
for the analysis.
Query processing data modeling and analysis These pro-
cedures concern the querying and data mining processes,
aimed to analyze data through implementation of Big
Data analytics techniques.
Big Data interpretation This step consists in understand-
ing the analysis and taking the right decisions regarding
a particular problem faced in regard to the complexity of
data.
In the same rein of thought, Gandomi and Haider (2015)
regroup these steps into major categories. Accordingly, the
first category involves Big Data including the processes and
supporting technologies relevant to acquire, store, prepare, and
retrieve data subject of analysis. It involves three main steps:
acquisition and recording, information extraction, cleansing,
and data integration, aggregation and representation. This
respect, Big Data management as defined by (Siddiqa etal.
2016), stands as a new discipline based on the application of
“data management techniques, tools, and platforms as stor-
age, preprocessing, processing and security” and serves “to
enhance data quality and accessibility for decision-making.”
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 12 of 28
In fact, the author advances a Big Data management process
flow and taxonomy that help to describe different activities
involved in extracting decisional information from Big Data.
It includes several activities, namely data collection, storage,
and preprocessing, through preparing the data collected for
analysis by means of such techniques and algorithms as data
cleansing (Kumar and Chadha 2012), transmission (Siddiqa
etal. 2016), processing, and analysis via the two data mining
methods of classification and prediction. As for the second
category, that of Big Data analytics, it consists in the applica-
tion of certain techniques useful for analyzing and acquiring
intelligence from Big Data. It includes the querying process,
data modeling, and analysis as well as the interpretation
step. This encapsulation offers a clear view for companies
and researchers working on social media data analytics about
the social data processing pipeline. It exclusively concerns
the selection of the appropriate techniques and methods fit
for achieving their goals. However, each step of the pipeline
exposes challenges basically related to the data dimensions.
Several discussions are established in this context, as in
Agrawal etal. (2012), respectively, in Chen and Zhang (2014),
the authors mentioned that the pipeline challenges consist in
heterogeneity and incompleteness, scale, timeliness, privacy,
and human collaboration, where the scale refers to the huge
volume of data that reveal problems related to data storage and
processing and timeliness is related to the speed of incoming
data and the time response. Siddiqa etal. (2016) also discuss
required parameters that should be handled during Big Data
management as availability of the system for user at any time,
scalability, data integrity, heterogeneity, resource optimiza-
tion, and velocity, while Olshannikova etal. (2016) classify
the challenges during Big Data processing into:
Data challenges related to the characteristics of the data
such as the volume, variety, velocity, veracity, dataqual-
ity, data availability, and scalability.
Processing challenges related to the methods used to cap-
ture, to transform, to model, etc., data.
Management challenges related to the privacy and secu-
rity of data during the processing steps.
Figure4 depicts the Big Data pipeline as used to
encapsulate the social media analytics process and the
challenges related to each step. In the upcoming subsec-
tion, these challenges are investigated along with their
requirements.
3.2.2 Big Data dimensions characterizing social data
Each step of the previously discussed pipeline exposes
several challenges related to the Big Data aspect charac-
terizing social data. So, in this subsection, our focus of
interest is laid on examining the remaining set of selected
papers that serve to introduce how social media data could
well stand as typical concretization of Big Data-relating
dimensions. Moreover, the challenges as sourced from
these dimensions are also addressed along with the pro-
posed solutions from the studied frameworks of the state
of the art.
3.2.2.1 Findings Volume Online social media websites offer a
diversity of free, easy to use, and public services making them
destined and available to all the Web users without any restric-
tions or costs being imposed. They are characterized with a
great number of active users. Actually, Twitter has reached the
threshold of about 330 million monthly active users2, while
YouTube has recorded over 1 billion users3 and Facebook
announced that it touches an average rate of about 2 billion
Challenges
Social Big Data analytics pipeline
Social Big Data management Social Big Data analysis
Data
collection
Data
storage
Data
preprocessing
Data
processing
Data
analysis
Data
interpretation
Data fidelity
Privacy
Security
Data
Quality
Data streaming
Real-time response Heterogeneity
Data
visualization
Scalability
Availibility
Inte
g
rit
y
Fig. 4 Big Data pipeline used to encapsulate the social media analytics process
2 https ://blog.hoots uite.com/twitt er-stati stics /.
3 https ://www.youtu be.com/intl/eng/yt/about /press /.
Social Network Analysis and Mining (2018) 8:30
1 3
Page 13 of 28 30
daily active users, in 2017.4 Indeed, every 20min, about 2 mil-
lion friendly requests and 3 million messages are sent and 1
million shared links5 are established on Facebook. Moreover,
no less than 3600 photographs turn out to be shared by pho-
tographers every minute on Instagram,6 while 300h of new
uploaded videos are registered to occur every single minute on
YouTube, and about 500 million tweets are discovered to be
registered every day on Twitter (see footnote 2). This highlights
well the huge amount of data generated andgets in analogy
with the volume dimension associated with Big Data.
As a matter of fact, the access to these social data raises
several technical challenges (Jagadish etal. 2014; Reuter and
Scholl 2014). In fact, it is based on the querying of social media
platforms through their available APIs (e.g., Facebook Graph
API,7 Twitter Search API8, and YouTube Data API,9 etc.)
which turn out to be quite limited in space, for example, the
YouTube Data API sets a limitation of 30,000 units per user
per second, while the entire quota per day is set at 50,000,000
units. Besides, researchers may also collect data via other meth-
ods too, such as Web crawling, which does not comply with
the terms of services of most social media platforms (Imran
etal. 2015). To achieve reliable results, the researchers need
to extract a huge amount of data that could appear to stand
as a sample for the analytic process. In fact, on studying the
characteristics of YouTube videos based on the 7.6MB of
crawled videos, Cheng etal. (2013) document that 900TB of
disk space is required to store nearly about 120 million You-
Tube videos. Karpenko and Aarabi (2011) develop a compact
representation called “tiny videos” that help to achieve high
video compression based on the extraction of 52,159 videos
occupying 520GB of disk space using the YouTube Data API,
and the metadata of the videos are stored in a file of 2.8MB in
size. In turn, Achrekar etal. (2011) develop the Social Network
Enabled Flu Trends (SNEFT) framework to track and predict
the emergence and spread of an influenza epidemic among a
particular population based on a collection of tweets and pro-
files extracted from Twitter over the period ranging from Oct
18, 2009 to Oct 31, 2010, whereby they collected 4.7 million
tweets from 1.5 million unique users along with their social
relationship from twitter and a number of retweets counting
9.5% of the total collected tweets.
These huge amounts of data reaching the gigabytes and tera-
bytes of data along with the extraction of multimedia data in
the form of videos and photographs appear to raise several chal-
lenges throughout the social media analytics process, mainly:
Data storage (Chen and Zhang 2014; Kaisler etal. 2013): As
the current disk technology limits are set to about 10 tera-
bytes per disk, 1 exabyte would require 10,000 disks, which
makes it difficult to attach the number of disks required.
Data processing: in terms of the processor speed as CPU
following the Moore’s Law. Yet, a fundamental shift is
under way nowadays: Data volume is increasing as a rate
that exceeds the CPU speeds and those of other comput-
ing resources (Jagadish etal. 2014).
Data visualization: The human eyes have difficulty in vis-
ualizing a large amount of data along with the computer
screen size, which is around 1 to 3 million pixels. These
challenges are categorized by Agrawal etal. (2015) under
the heading of perceptual scalability.
Variety Social media websites offer a huge amount of data
including profile data, user connections,multimedia metadata
(Hiba etal.2018) and data describing user’s daily activities.
People can upload videos, photographs and express opinions
through textual posts and feelings through a diversity of actions
alike. Thus, online social media data turn to be based on user-
generated content, which is defined by Kaplan and Haenlein
(2010) as the sum of all ways, whereby people make use of
Social Media. They add that the term has emerged ever since
2005 and describes different forms of media content as publicly
available and created by end users. In turn, the Organization for
Economic Cooperation and Development (OECD) lists differ-
ent types of user-generated contents (e.g., texts, photographs,
audios, videos, citizen journalism, mobile contents, etc.) (Vick-
ery and Wunsch-Vincent 2007).
Therefore, social media data abounded with a huge
variety of data types classified as structured data, as pro-
file information and unstructured data as YouTube videos,
Facebook posts, and Google Plus activities. Social media
data stand as sticking examples of unstructured data. Hence
the rise ofseveral challenges through both data storage and
data processing stages, mainly.
Data storage: Social media data include such multimedia
data as photographs and videos.
A photograph may have one or more than one color.
So, there are no predefined fields characterizing photo-
graphs, which makes them unfit for storing in a relational
database. So, storing such data requires new technologies
to support the lack of predefined schema.
Data processing: Fig.5 illustrates a tweet extracted from
Twitter using the Twitter4j10 library. The extracted tweet
is represented in a semistructured format (JSON) but it
does not seem liable to analysis as what really matters is
actually the ability to query the data. In such a case, the
real value of the extracted data lies in recognizing what is
4 https ://newsr oom.fb.com/compa ny-info/.
5 http://www.stati sticb rain.com/faceb ook-stati stics /.
6 https ://vibbi .com/buy-insta gram-follo wers.
7 https ://devel opers .faceb ook.com/docs/graph -api.
8 https ://dev.twitt er.com/overv iew/api.
9 https ://devel opers .googl e.com/youtu be/v3/?hl=de.10 http://twitt er4j.org/en/index .html.
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 14 of 28
being actually tweeted; this makes a reference to the value
of the text attribute. This value corresponds to a textual
format that contains some links, a text, and annotations.
Figure6 depicts the results of a query relevant to extract
a video from YouTube using the YouTube Data API. The
query response corresponds to a JSON format highlighting
the metadata of the video and the video link. Indeed, the
response does not appear to reveal any information about
the video content. So, processing such data type requires the
existence of appropriate techniques fit for extracting useful
information and then value on the basis of the multimedia
content (text, photograph, video, and audio).
Velocity This dimension concerns the generated data-
associated speed. Given the availability of online social
media via such mobile applications as Facebook, YouTube,
and Instagram, users never stop getting connected to these
applications and continuously producing data (He etal.
2017). The recently released statistics reveal that 500 mil-
lion tweets are daily posted on Twitter11 and 300h of video
are uploaded on YouTube every minute. Statistics also indi-
cate a huge amount of data being generated in the scale of
minutes and days (see footnote 3). Figure7 illustrates the
tweets generated by the CNN official page over the period
ranging between July 1, 2017 and the July 20, 2017. Figure7
Fig. 5 Extracted tweet in a
JSON format using the Twit-
ter4J Library from the Twitter
microblogging website shows
the different format (link, text,
annotation) of data included in
one textual tweet
Fig. 6 Extracted video in
a JSON format using the
YouTube Data API from the
YouTube sharing website
Fig. 7 Evolution of the total number of tweets replies and retweets
generated by CNN and its reactors on Twitter between the period of
the July 1 and the July 20, 2017 using the TweetStats tool
11 https ://www.omnic oreag ency.com/twitt er-stati stics /.
Social Network Analysis and Mining (2018) 8:30
1 3
Page 15 of 28 30
is generated via the TweetStats tool, and it reflects the aver-
age number of tweets generated by the CNN news on a daily
basis, which appears to reach a rate of 137, 9 tweets per
day and about 2895 tweets per month. It also reflects cer-
tain information about the number of retweets and replies
corresponding to the generated tweets. In some particular
cases, the analysis of social data proves to entail the pro-
duction of rapid responses following the data analysis, as
developers keep seeking to minimize the latency time as
well as analyze the data that come in streaming as it is the
case with crisis responders, who may want lower latency
for a better response to a developing situation (Imran etal.
2015; Middleton etal. 2014). In effect, Twitter and YouTube
offer, respectively, a Twitter Streaming APIs12 and YouTube
Live Streaming API13 providing developers with low latency
access to the tweet data and video-streamed data. Still, ana-
lyzing such data raises major challenges likely to encounter
during the data processing task as it requires the availability
of adequate techniques and technologies fit for processing
data in real time (Jagadish etal. 2014).
Veracity Social media data are characterized with its
related veracity. In fact, based on human-generated content,
social data are usually full of rumors (Ashwin etal. 2016;
Mendoza etal. 2010). Such unverified information might
have a negative effect on the decision-making process and
be confusing to people (Ashwin etal. 2016). So, identify-
ing and detecting these rumors are imposed during the data
analysis, procedure for reliable results to be achievable. Fig-
ure8 introduces the projection of Big Data dimensions on
social media data features.
The integration of Big Data technologies to solve the above
challenges is applied along with the state-of-the-art pro-
posed frameworks turning out to provide some schema based
on social Big Data analysis, whereby the authors attempt to
describe the relevant steps along with the implementation of
Big Data technologies. Example of these frameworks is the
platform Social Media Analysis using Big Data Technology
(SoMABIT) as developed by Bohlouli etal. (2015) dislocating
into three major layers: a data layer useful for describing the
different sources providing social media data, a logic layer for
knowledge discovery, and a decision-making procedure via the
application of Big Data technologies such as Hadoop (White
2012), Mahout (Owen and Owen 2012) and the implementa-
tion of distributed algorithms through the MapReduce (Dean
and Ghemawat 2008) paradigm, along with an application layer
Big Data
Dimensions
Projection
Social media data
1.4 billion daily active users
on average for December
201714
Since 2015 more than
10,000 videos. They
recorded over a billion
views and more than 70
million viewing hours. 15
330 monthly active users16
Streaming
Data
>>Terabytes of data Text, Videos, Photos Production at a scale of
SecondsSpread of
rumors
Fig. 8 Projection of the four Vs of Big Data dimensions on social media data
12 https ://dev.twitt er.com/strea ming/overv iew.
13 https ://devel opers .googl e.com/youtu be/v3/live/getti ng-start ed.
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 16 of 28
acting as an end-user interface relevant to querying the plat-
form and visualizing discovered knowledge. In effect, the need
for extracting knowledge from social media data is highlighted
by He etal. (2017) in a bid to improve the organizational and
corporate performance. They also state the limits associated
with traditional content analysis as enhancement for the imple-
mentation of systematic methods whereby knowledge could be
extracted from social data. To this end, they propose a frame-
work that rests on Big Data analytics technologies to “process
Social Big Data, visualize and benchmark comparisons among
competitors across events, products, issues and any other areas”
and store knowledge within a specified knowledge management
system applicable by managers and employees, alike. The study
conducted by Peng etal. (2017) stands as another intervention
to enhance the relationship persistent between social influence
analysis and Big Data.
In this respect, the social influence analysis process is
applicable through the implementation of a number of dif-
ferent steps, namely data collection and storage within the
cloud infrastructure, a preprocessing step intended to clean
data from irrelevant private information by means of Big
Data techniques, such as machine learning, data mining, nat-
ural language processing in a bid to ensure the performance
of the following step, as defined by social influence analysis,
including algorithms such as (selection of the users’ influ-
ence, performance analysis of related algorithms, selection
of evaluation metrics and influence computing) culminating
the application of social influence analysis to several use
cases such as stock market prediction and personal recom-
mendation. Table4 summarizes the social media data-relat-
ing aspects and the relevant challenges and it highlights the
different challenges rising during the data analytics process.
3.2.2.2 Discussion: theneed ofBig Data technologies The
reached results appear to reveal well that the Big Data dimen-
sions as associated with social data display either material
challenges such as storage and processing devices (i.e. CPU,
storage disk) characterized by a limitation of capacity or
speed of devices during the storage and processing steps,
technical challenges, which refer to the inefficiency of tradi-
tional approaches and methods on the analysis of the multi-
media data. Moreover,the technologies challenges describe
the inefficiency of tools in handling the huge amount of
unstructured data as the case of visualization tools. Thus,
the inefficiency of the data analysis methods (Orgaz etal.
2016) and traditional tools (Orgaz etal. 2016; Sapountzi and
Kostas 2016) for analyzing large scale of unstructured data
promotes the integration of Big Data-related technologies
as an optimum solution poses special social media analytics
types of challenges. Hence, novel paradigms and software
(Chang etal. 2014) have been developed to handle the mas-
sive volume, variety, and velocity of the issued data in a
bid to facilitate the useful value extraction from them with
respect to various respects and purposes.
This has culminated in the emergence of new research
area, dubbed Social Big Data, with the aim of combining
Big Data tools and technologies with traditional social data
analytics-related techniques for the sake of boosting them
and ensuring an effective data processing and analysis pro-
cess, as outsourced from social media. As illustrated through
Fig.9, Orgaz etal. (2016) consider that Social Big Data is
the result of two combined domains: Big Data and social
media. They also define Social Big Data as a concept useful
for describing the processes and methods applied to process
social data that cater for the Big Data basic dimensions such
as volume, variety, and velocity with the aim of extract-
ing useful knowledge for users and companies, alike. In this
respect, Cambria etal. (2014) consider the Big Social Data
analysis as “inherently interdisciplinary and spans areas such
as machine learning, graph mining, information retrieval,
knowledge-based systems, linguistics, common-sense rea-
soning, natural language processing, and Big Data comput-
ing.” Based on the data-type categorization, Sapountzi and
Kostas (2016) provide a special view in regard to social
Table 4 Challenges of the social media analytics steps based on the social data aspects
Social data aspect Challenges Social media analytics steps
Volume Provide more space to store data(Chen and Zhang 2014;
Jagadish etal. 2014)
Volume is increasing faster than CPU speed (Chen and Zhang 2014)
Ensure data scalability(Chen and Zhang 2014)
Data storage
Data processing
Data visualization
Variety Homogenize the data for the analysis (Jagadish etal. 2014)
Provide analytics methods for the multimedia data
Store non-schema data
Data storage
Data preprocessing
Data analysis
Velocity Ensure the data availability
Real-time response (Chen and Zhang 2014)
Real-time visualization
Data capture
Data storage
Data processing
Data visualization
Veracity Noisy data due to the user-generated data
Spread of rumors (Ashwin etal. 2016; Mendoza etal. 2010)
Data preprocessing
Data processing
Social Network Analysis and Mining (2018) 8:30
1 3
Page 17 of 28 30
networking data analysis. They adopt social network analysis
techniques useful for resolving a diversity of tasks, namely
link prediction, community detection, and influence analysis
relevant to structured data analysis and the integration of Big
Data analytics such as text and multimedia mining associ-
ated with analyzing social networking data.
The reviewed papers present the challenges and the solu-
tions related to the Big Data dimensions. However, their
studies are limited to the description of the use of Big Data
technologies for resolving specific use cases. Thus, in the
next section, we combine the previous results in order to
draw a global view describing the alliance of Big Data and
social media analytics.
4 Discussion
The investigated papers in this survey declared two main
points: first, no predefined steps for analyzing social media
data. Second, the Big Data aspect of the analyzed data is a
challenge faced through the integration of Big Data tech-
nologies. For the first point, we propose the use of Big Data
pipeline to encapsulate the social media analytics steps. For
the second one, each one of the previous researches proposes
a way to integrate the adequate Big Data technologies and
methods but for a target use case. The link between these
researches can be useful to build a global view of social
media analytics process under a Big Data environment.
Data collection and acquisition a step that refers to gath-
ering data from different sources and their transmission
to data storage platform. Social networking websites
(e.g., Facebook), microblogging (e.g., Twitter), and mul-
timedia sharing websites (e.g., YouTube), as application
of online social media websites, appear to provide appli-
cation programming interfaces (API) for extracting data
such as Facebook Graph API,14 Twitter Search API,15
YouTube Data API16, and Google + REST API,17 as three
different approaches advanced by (Stieglitz and Dang-
Xuan 2013) for the purpose of tracking data from social
media: self-involved approach, keyword/topic-based,
actor-based approach, and random/exploratory approach.
At this level, the data are collected according to different
formats, namely structured, unstructured data.
Data Recording or storing refers to systems applied to
store collected data while considering the challenges
related to volume, privacy, and scalability of data. This
step is based on developed techniques such as clustering,
replication, and indexing (Siddiqa etal. 2016). In fact,
traditional data storage process (relational databases)
remains as single-node-based system with fixed schema,
rendering the storage of Big Data a challenging task.
Hence, Big Data offers data storage technologies based
on the sharing, clustering, replication, and indexing prin-
ciples (Corbellini etal. 2017; Siddiqa etal. 2016). In this
regard, storage systems stand as file systems-based tech-
nologies as Google File System (GFS) (Ghemawat etal.
2003), Hadoop Distributed File System (HDFS) (Orgaz
Fig. 9 Illustration of Social Big
Data
Big Data
analytics
Machine learning
Image mining
Audio analytics
Video anal
y
tics
Text analytics
Graph mining
Social data (volume,
velocity, variety, etc.)
Extracted Knowledge
Social Data
Big Data Analytic
s
Social Big Data Analytics
14 https ://devel opers .faceb ook.com/docs/graph -api.
15 https ://dev.twitt er.com/overv iew/api.
16 https ://devel opers .googl e.com/youtu be/v3/getti ng-start ed.
17 https ://devel opers .googl e.com/+/web/api/rest/.
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 18 of 28
etal. 2016), and databases technologies as NoSQL data-
bases (Corbellini etal. 2017).
NoSQL databases are provided to store distributed
data and allow a horizontal scaling opportunity.
These databases involve four categories:
key value (Corbellini etal. 2017), for example Redis
(Carlson 2013);
Document-oriented database (Corbellini et al.
2017), for example MangoDB (Chodorow 2013) and
CoucheDB (Lennon 2009);
– wide column (Corbellini et al. 2017), for example
HBase (Taylor 2010) and BigTable (Chang etal. 2008);
graph databases (Corbellini etal. 2017), for example
Neo4j18 and AllegroGraph (Aasman 2006).
Data preprocessing refers to the methods applied to pre-
pare data in a specified format fit for analysis, using a
variety of techniques such as data cleansing (Kumar and
Chadha 2012) that help in handling challenges related
to the missing values, the noise embedded data and data
inconsistency, data transmission (Siddiqa etal. 2016),
data reduction (Santhanam and Padmavathi 2014) deal-
ing with such techniques as data compression and redu-
plication, data integration (Ahamed etal. 2014; Espos-
ito etal. 2015) referring to the combination of data as
decoded from multiple sources and data transformation
(Baskar etal. 2013) that deals with the process of data
normalization, aggregation, and generalization of data.
To note certain challenges seems to be associated with
this step, namely data heterogeneity, noisy incomplete-
ness, etc.
Data processing Processing Big Data rests on parallel
and distributed programming paradigms. In this respect,
four main processing paradigms could be distinguished:
batch processing, streaming processing, interactive pro-
cessing (Chen and Zhang 2014), and large-scale graph
processing as introduced by (Sakr 2016)
The batch processing relies on the MapReduce
(Dean and Ghemawat 2008) paradigm, whereby the
data are firstly stored and then processed. The most
popular framework widely implementing this para-
digm is Hadoop (White 2012) as create as Dryad
(Isard etal. 2007).
The streaming processing is used to process stream-
ing data and get real-time responses (Baquero etal.
2016), such as Storm, S4 (Neumeyer etal. 2010), and
the Apache kafka (Auradkar etal. 2012) frameworks.
The interactive processing as defined by (Chen and
Zhang 2014) is a framework that “presents the data
in an interactive environment, allowing users to
undertake their own analysis of information” as it is
the case of Google’s Dremel (Mikolov etal. 2011).
The large-scale graph processing used to process
large-scale graph as, for example, Pregel (Malewicz
etal. 2010), and it is worth noting at this level that
a clear distinction needs to be established among
processing frameworks for large-scale RDF graphs
(Sakr 2016).
Data analysis describes Big Data techniques as machine
learning, text analytics, and multimedia analytics to get
insights from the relevant data. It is provided through a
variety of Big Data tools and libraries that implement
Big Data analytics (Sapountzi and Kostas 2016) such
as multimedia analysis and text mining relevant to sev-
eral disciplines (Chen and Zhang 2014) as data mining,
machine learning, social network analysis, etc. In this
regard, Big Data offers a variety of tools that operate on
the top Big Data processing frameworks. Worth citing
in this respect are the Apache Mahout (Owen and Owen
2012), Skytree server,19 SparkMlib (Meng etal. 2016)
for machine learning, Apache Nutch (Orgaz etal. 2016)
regarding the business data analysis context.
Data interpretation refers to the visualization of reports,
diagrams, and tables in a format recognizable by end
users and still contains challenges related to data com-
plexity. It is introduced via Big Data visualization tools
in a bid to create understandable insights based on the
analysis for end users. Visualization tools offer a clear
view about the interpreted data and allow the interaction
between users and extracted insights via the visualiza-
tion of reports and graphs. Several Big Data tools are
applicable in this regard, such as Pentaho20 for reports,
Tableau21 for data visualization, Jaspersoft package22
for generating business intelligence reports, and Talend
Open studio23 for the graphic visualization.
Figures10 and 11 detail each related step through highlight-
ing the Big Data technologies involved along with the relevant
methods and techniques applicable to each step. Hence, Fig.10
deals with the steps related to the Social Big Data management,
while Fig.11 illustrates the analysis and interpretation steps
relevant to the Social Big Data analytics process.
To better explain the usefulness of the proposed frame-
work, first, we establish an analogy of some existing frame-
works from the literature and the proposed framework archi-
tecture. Second, we picked a social media analytics task
19 http://www.skytr ee.net/.
20 https ://www.penta ho.com/.
21 https ://www.table au.com/.
22 https ://githu b.com/Jaspe rsoft /jaspe rrepo rts.
23 https ://www.talen d.com/produ cts/talen d-open-studi o/.
18 https ://neo4j .com/.
Social Network Analysis and Mining (2018) 8:30
1 3
Page 19 of 28 30
Data Sources Data Collection Data StorageData preprocessingData processing
Media Sharing
Microblogging
Social Networking
Social media data
sources
Tracking approaches
(Stieglitz and Dang-Xuan
2013
)
Self-involved
Keyword-based
Actor-based
Random/Explorative
URL-based
Tracking Methods
(Stieglitz and Dang-Xuan
2013
APIs : Facebook Graph
API, Twitter Search API,
YouTube Data API ,
Google + REST API
RSS/HTML -parsing
(blogs)
Output Data Format
Structured data
Semi-structured data
Unstructured data
Storage techniques (Siddiqa
et al. 2016)
Clustring
Replication
Indexing
Storage Systems
File S
y
stem
NoSQL databases
GFS
HDFS
COSMOS
TFS
Preprocessing algorithms
Data cleansing (Kumar
and Chadha 2012)
Missing Value
Noisy Data
Inconsistent data
Data transformation
(Baskar et al. 2013)
Normalization
Aggregation
Generalization
Data Reduction (Santhanam
and Padmavathi
,
2014
)
Compression
Reduplication
Programing model and
Tools
Batch processing: Hadoop
Stream processing: S4,
Apache Kafka, Storm, etc.
Interactive Processing:
Google’s Dremel
Large-scale graph processing:
Pregel, Apache Giraph, etc.
Fig. 10 Social Big Data management steps
Data analysis
Content-
based analysis
Structure-
based analysis
Topic /issue/trend
anal
y
sis
Opinion sentiment
analysis
Social network
anal
y
sis
Statistical analysis
Analysis type Analysis objective Analysis method Big data analysis tools
Text mining
Video content analysis
Image analysis
Trend detection
Opinion mining
Sentiment analysis
Link prediction
Community detection
Influence analysis
Cluster analysis
Linear regression
Machine learning and Data
mining
Search Engine
Statistical analysis
Data interpretation
Dashboard
Report
Graph
Analysis output and big data tools
Talend
Pentaho
Tableau
Qlikview
Gephi
Walrus
Fig.11 Social Big Data analysis steps
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 20 of 28
consisted on analyzing the sentiment of Twitter users for a
new brand product. Then, we describe the processing of this
task based on the predefined steps of the proposed framework
in order to explain the utility of the proposed framework.
For the analogy between existing frameworks and the
proposed architecture, Table5 introduces a set of applica-
tions implementing the conceptual view of the Social Big
Data analytics framework. The table presents some simple
example of applications through introducing the techniques
applicable to each step of the predefined Social Big Data
analytics framework pipeline. In this respect, Selvan and
Moh (2015) stress the cruciality of real-time customer
feedbacks for companies through analyzing the tweets,
as extracted via the Twitter streaming API and processed
using the Hadoop framework. The data are stored prior the
analysis in the HDFS component of Hadoop framework. In
this context, and for the purpose of investigating the public
opinion on a particular topic of interest, Bhuta etal. (2014)
undertake to analyze Twitter data using sentiment analysis
techniques after the collection of a set of public tweets using
Twitter streaming API and its filtration through eliminat-
ing the non-English words. The analysis results are reported
using statistical graphs and geographical charts. Table5
also introduces a framework developed by (Bohlouli etal.
2015) displaying a concrete representation of the proposed
conceptual Social Big Data analytics framework. The data
were collected using the Twitter Streaming API, the Flume
component of Hadoop enabling the transfer of the collected
data to the HDFS for the storage step. Similarly, Bohlouli
etal. (2015) use the Hadoop framework to process the col-
lected data, Mahout for analysis purposes and visualize the
attained results through a user interface involving reports,
diagrams, and curves.
The variety of Big Data technologies makes the iden-
tification of the adequate technology a confusing task for
individuals. In this respect, the current study highlights the
most popular Big Data technologies suitable for both storage
and processing step. Also, it provides a summary of their
major features. In fact, the state of the art reveals the exist-
ence of several Big Data storage technologies (Strohbach
etal. 2016), namely: distributed file systems such as Hadoop
File System (HDFS), NoSQl databases, NewSQL databases
along with Big Data Querying Platforms. The current work
identifies the features related to each data model by giv-
ing examples of technologies for each one of them. Several
features are derived from the state of the art (Hashem etal.
2015; Siddiqa etal. 2017; Wu etal. 2017, Corbellini etal.
2017), namely: Consistency, Availability, Partition (CAP)
(Han etal. 2011b) (at most only two of the three features
can be verified by a technology), along with architecture
and the data storage characterizing the technology. Table6
depicts Big Data technologies-related features along with
the advantages and disadvantages of each data model. For
the processing step, the state of the art (Yaqoob etal. 2016;
Belcastro etal. 2018; Hu etal. 2014; Cao etal. 2017) reveals
a set of features that serve to compare between the different
Big Data processing paradigms-related technologies. Among
those features, we cite: scalability defining the ability of a
system, network, or process to handle a growing amount of
work either by adding new resources to a single node or add-
ing new nodes to the system (Cao etal. 2017), fault tolerance
referring to the system operation continuity despite the fail-
ure of node (Cao etal. 2017), latency describing the speed
of the system response; this feature is usually used for com-
paring real-time technologies (Chintapalli etal. 2016) and
the programming model (i.e., MapReduce, Directed Acyclic
Graph (DAG),Message Passing, Bulk Synchronous Paral-
lel (BSP), Workflow and SQL-like.) (Belcastro etal. 2018)
implemented by the technology. Table7 details the features
along with applications related to the most popular Big Data
processing technologies relative to each processing para-
digm (i.e., batch, stream, interactive and graph processing).
In order to reach the target objectives from the analysis, the
research should set a specifically clear plan prior to proceeding
Table 5 Social Big Data analytics framework applications
Social Big Data conceptual framework steps
References Data Sources Data Collection Data Storage Data preprocess-
ing
Data processing Data analysis Data interpreta-
tion
(Selvan and
Moh 2015)
Twitter Streaming API
and Flume
component of
Hadoop
HDFS of
Hadoop
Filtration done
based on
keyword
Hive of Hadoop,
MapReduce
paradigm, and
Apache Oozie
Text analysis Visualization
in Microsoft
Office Excel
(Bhuta etal.
2014)
Twitter Twitter Stream-
ing API
No storage as it
is a real-time
processing
Filter non-Eng-
lish tweets
Dictionary-
based clas-
sification
Sentiment
analysis
Statistical graphs
geographical
charts
(Bohlouli etal.
2015)
Twitter Social media
API and Flume
Hbase Filter the data
and reduce
noise
Implementation
of the Hadoop
MapReduce
Mahout for deci-
sion making
and sentiment
analysis
User interface-
based Html
Social Network Analysis and Mining (2018) 8:30
1 3
Page 21 of 28 30
Table 6 Features-related Big Data storage technologies
Key-value data model Column-oriented data model Document data model Graph data model
Redis MemcacheDB Cassandra Hbase Hypertable MangoDb CoucheDB Neo4J AllegroGraph (RDF
graph)
Scalability Supported (High) Supported Supported Supported Supported Supported Supported Supported Supported
C: Consistency,
A: Availabil-
ity, P: Partition
CP CP AP CP CP CP AP CP CP
Architecture Multi data nodes Master/slave Multi-master Master/slave Master/slave Master/slave Master/slave Master/slave
Data Storage In-memory In-memory In-memory On-Disk On-Disk and In-
memory
On-Disk On-Disk On-Disk and In-
memory
On-Disk
Query Language API API MapReduce MapReduce Thrift Interface SQL MapReduce Cypher, SPARQL SPARQL, RDFS++
Description Data are stored as a distributed hash
table
Data tables are stored as sections of columns of data.
It is an extension of key-value store database, where
columns can have a complex structure, rather than a
blob value (Storey and Song 2017; Bermbach etal.
2015)
A collection of key-value stores
where the value is a document,
such as JSON, BSON (Abra-
mova and Bernardino 2013)
Based upon graph theory (set of nodes,
edges, and properties) (Storey and
Song 2017)
Advantages Very fast random access via Key,
scalable, easy to distribute across
clusters, and provides a simple
model as a hash table (Storey and
Song 2017)
Several popular websites use key-
value store data model, namely:
“Dynamo at Amazon, Redis at
GitHub, Digg, and Blizzard Inter-
active, Memcached at Facebook,
Zynga and Twitter and Voldemort
at Linkedin” (Atikoglu etal.
2012)
Easy distribution, management of very large volumes
of data and partial update
Good for semistructured data,
easy partition, the number of
requests for composite objects
is limited, permission of the
ad hoc applications and partial
update
Represent many large real-world entities
such as maps and social networks (e.g.,
OpenStreetMap, Twitter) (Corbellini
etal. 2017)
Represent linked open data (RDF graph)
and offer a fully query language
(SPARQL)
Disadvantages No complex filtering query, the
join needs to be performed in
the applications, and there is
no mechanism for supporting
multirecord consistency (Corbel-
lini etal. 2017; Storey and Song
2017)
All joins must be made in the code, no constraints and
no triggers (Corbellini etal. 2017)
All joins must be made in the
code, no constraints and no trig-
gers (Corbellini etal. 2017)
Not efficient at processing high volumes
of transactions
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 22 of 28
Table 7 Features-related Big Data processing technologies
a https ://www.micro soft.com/en-us/resea rch/proje ct/dryad /
b https ://cloju re.org/
Batch processing Stream processing Graph Processing
Hadoop DryadaS4 Spark streaming Apache storm Pregel Graphlab
Programming model MapReduce Directed Acyclic
Graph
MapReduce Directed Acyclic
Graph
Directed Acyclic
Graph
Bulk Synchronous
Parallel
MapReduce
Programming lan-
guage
Java C++ Scala Scala ClojurebC++ C++
Scalability Supported Supported Supported Supported Supported Supported (Cao etal.
2017)
Supported (Cao etal.
2017)
Fault tolerance Supported on node
level.
Supported on node
level.
Supported Supported Supported by the
Nimbus, the master
node, in case of
failure, the passive
node becomes active
without affecting the
workers
Fault tolerance by
check pointing
Supported by snapshot
update
Latency MapReduce reads and
writes from disk,
which slows down
the processing speed
Not supported:
(process terabytes
of data at scale of
minutes) (Cao etal.
2017)
Lower latency due
to the use of local
node memory (Cao
etal. 2017)
Micro-batch: runs
applications up
to 100× faster in
memory and 10×
faster on disk than
Hadoop (Chintapalli
etal. 2016; Xin
etal. 2012)
Lower latency
response (Chinta-
palli etal. 2016;
Belcastro etal.
2018)
Not supported Not supported
Applications Useful for “distributed
sorting, Web link-
graph reversal, Web
access log stats,
inverted index con-
struction, document
clustering, machine
learning, and statis-
tical machine trans-
lation” (Dean and
Ghemawat 2010;
Cao etal. 2017)
Used by Microsoft to
analyze petabytes
of data belongs to
clusters of thousand
computers (Cao
etal. 2017)
A general purpose
framework, used by
Yahoo, Google, and
Bing for processing
unlimited streams
of data (Neumeyer
etal. 2010)
A stream processing
engine for applica-
tions based data
mining and machine
learning (Neumeyer
etal. 2010)
Store data on RAM
memory, which
makes it faster
than Hadoop on
processing iterative
machine learning
(Belcastro etal.
2018, Xin etal.
2012)
Applications related
to the processing of
social network data
and sensor networks
(Belcastro etal.
2018)
Graph computing: PageRank (Malewicz etal.
2010), shortest path, and bipartite matching
Computing social network analysis as the case
of Facebook in which the graph processing
is used to analyze the social graph formed
by users and their connections (Hu etal.
2014)
Netflix Movie Recommendation (Low etal.
2012)
Friends of friends score application (Ching
etal. 2015)
Social Network Analysis and Mining (2018) 8:30
1 3
Page 23 of 28 30
the analysis process. Accordingly, one could well refer to the
proposed framework identifying the steps, cited herein, that
should be followed, while accounting for the challenges and
useful solutions provided in this respect, more specifically:
1. Identifying whether the case does actually display a Big
Data problem issue, through examination of the collected
data pertaining characteristics. In the selected study case
sample, the analysis subject involves a selection of tweets
extracted from Twitter, whereby the following dimensions
are tackled and treated: the volume (to be quantified), the
variety (tweets: unstructured data), the velocity (real-time
data), as well as the veracity, therefore, the entirety of
these conditions satisfy a Big data challenge.
2. Observing each step involved in the proposed frame-
work through identification of the goals, challenges, and
requirements likely to prevail, along with the possible
solutions proposed.
Data collection
Goal: collect real time data about the target subject
Solutions:
-Method: Search by keyword.
-Tools: Twitter streaming API.
Data Storage
Goal: Store huge amount of unstructured data/easy access to the
collected data
Challenges: Volume and Variety.
Solutions:
-Tools: NoSQL database, cloud solution, etc.
Data preprocessin
g
Goal: Extract the relevant tweets, delete the noise (unsfull metadata)
Challenges: volume, untructured data.
Solutions: Data cleansing algorithms.
Data processing
Goal: parallel analysis of the tweets .
Challenges: volume.
Solutions:
-Tools: Hadoop (batch)/ Spark (real-time) and Sentiment
analysis tool.
Data Analysis
Goal: generate socres relative to user sentiment.
Requirement: identify if it is a real-time or batch processing.
Challenges: volume and variety.
Solutions:
-Identify the adequate sentiment analysis method.
-Tools: Sentiment analysis tool (e.g.R langage).
Data interpretatio
n
Goal: Visualize the simple sentiment scores values and interpret all
the
scores to identify the popularity of the brand.
Challenges: volume.
Solution: theuse of reporting tool supporting the data volume.
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 24 of 28
The present survey targets researchers who envisage
to analyze social media data, for any particular purposes,
without having any clear idea about the steps necessary to
pursue to execute the analysis procedure, much less about
the challenges likely to be encountered all through the
analysis, nor even the adequate Big Data technologies fit
for application in this respect. Accordingly, throughout the
scope of the present article, researchers should be able to
learn about:
The steps necessary to follow for an efficient social media
analytics task to be effectively conducted.
How to ensure that they are really dealing with a prob-
lem of a Big Data type during the social media analysis.
Accordingly, they can turn to the section-dubbed Big
Data dimensions, characterizing social media data, to
recognize how each dimension is mapped to conform
with social data, to be aware that it is not only the volume
that characterizes a Big Data problem.
Identifying each step-associated challenges before their
emergence. Concerning data collection, for instance, they
have to know how much data they will need to collect
and therefore, how much memory space will be required.
They also need to identify whether the application per-
tains to a real-time analysis or to a batch analysis so that
the adequate technologies can be identified, along with
the appropriate algorithms fit for maintaining a real-time
storage, analysis, and response. Additionally, they have to
identify the adequate methods whereby multimedia data
can be analyzed when unstructured data are being dealt
with.
The most commonly appropriate Big Data tools fit for
implementation with each step. Actually, the present sur-
vey describes several frameworks, tools, and algorithms
commonly applicable throughout the analysis process.
For instance, regarding the storage process, the users
could apply either NoSQL databases, files systems, or
cloud storage; as for the processing step, the relevant
frameworks are categorized according to their use (batch,
stream, interactive, and graph processing), and for the
analysis stage, the convenient tools are categorized
according to their functionalities (searching, mining, or
statistical analysis, etc.).
5 Conclusion andpotential research trends
The present work is focused on studying the joint interaction
between social media analytics and Big Data. Based on the
reached results, the paper ends up with setting up a spe-
cial type of alliance between social media analytics and Big
Data. The alliance is undertaken based on two major levels,
namely the social media analytics’ processing steps and the
social media analytics applied technologies.
Concerning the processing steps, the state-of-the-art
model maintains that each social media analytics-based
framework to develop should follow its proper relevant steps
throughout the social media data processing procedure. Still,
no clear view seems available as to how such social data can
be processed. Hence, the Big Data processing pipeline is put
forward in a bid to encapsulate the social media analytics-
associated processing procedures. This encapsulation is
established following an amalgamation analogy involving
the Big Data dimensions and social media data features,
whereby it has been demonstrated that social media data
are only a Big Data source encompassing the entirety of the
discussed dimensions associated with Big Data, specifically:
volume, variety, velocity, and veracity. As for the relevant
processing steps, they are depicted in the following phases: a
data collection stage, as gathered through different methods
such as social media APIs; a data storage phase, that requires
the support of huge platform of continuously generated
unstructured data; then comes the data preprocessing step
that involves the implementation of specific algorithms to
clean the data and prepare it for the processing step; finally,
there lies the analysis and interpretation stage that helps
recapitulate and ensure the visualization of useful insights
as drawn and extracted from social media data to make them
understandable and easy to recognize by the end users.
Concerning the second level of the alliance, it rests on
integrating the Big Data technologies relevant to each step
of the social media analytics process. Introduced under the
Social Big Data research field, the combination identifies for
each step of the social media analytics process the appro-
priate frameworks fit for application to help optimize the
analysis results and support the smooth flow of social data.
Indeed, the Big Data-related technologies help promote the
processing of huge amount of unstructured data via appli-
cation of the Hadoop and also Storm frameworks, for real-
time processing purposes of the continuously generated data,
given the need for effective systems fit for dealing with such
speedy data flow. For each of these frameworks, a set of
libraries are developed, for example, Mahout, Mlib, Solor,
and GraphX, to support such various analyses as machine
learning, search engine, and graph computation. For visu-
alization purposes, also, Big Data proves to offer a set of
technologies whereby the analysis results could be visual-
ized either on dashboards, for example Qlkview, on tables,
for example, Microsoft Tableau Software, or even through
huge graphs, as is the case of Gephi.
Figures10 and 11 depict the hybrid alliance by intro-
ducing through a conceptual framework how social media
analytics can be established under a Big Data environment
context.
Social Network Analysis and Mining (2018) 8:30
1 3
Page 25 of 28 30
Overall, this study may well stand as an initial guide for
those who envisage to deal with analyzing social media data,
by serving them to get a clear view of the processing steps
involved in social media analytics and the Big Data devel-
oped technologies applicable to each step of the pipeline.
Considering the promising results brought about by this
study, they lie in providing a conceptual framework associ-
ated with the Big Data pipeline and Social Data analysis
processes. Still, further studies, some of which are cur-
rently underway in our laboratories, seem necessary to fur-
ther explore the proposed scheme’s feasibility and potential
performance of the merger of the collection processes as
derived from several social networks, while accounting for
the data heterogeneity dimension.
It is also critically important to get an idea of the process-
ing steps involved in Social Big Data analytics and identify
the challenges related to each relevant step, while evaluating
the Big Data technologies developed to manage such a huge
amount of social data. Yet, what really matters is choice
of “the right tools for the right job” in order to ensure the
achievement of the desired analytics’ goals subject of under-
standing and prediction through the Big Data processing.
Indeed, to help people to benefit from Big Data technolo-
gies, this work presents the different Big Data technologies
used during the storage and processing steps. In addition, it
reviews the storage technologies-related features (i.e., scal-
ability, CAP, architecture, data storage, and query language)
and also the processing-related features (i.e., open source,
scalability, latency, API, fault tolerance, and programming
language) along with their advantages, disadvantages, and
applications.
Constructing an effective and successful analysis strategy
is the first step in extracting insights from data. To ensure
the success of such a strategy, a set of requirements should
be fulfilled and instructions need to be followed, includ-
ing the identification of the objectives lying behind the data
analytics procedure; talking to the stakeholders, i.e., getting
into discussion with people from the business and techni-
cal side likely to be affected by the analysis result, in order
to understand their specific needs, identify the appropriate
data useful for achieving the analytics’ goals, and identify
the adequate techniques and tools required for processing
data, as each tool has proper features fitting it to deal with
a specific need.
This work has been specifically conceived to deal exclu-
sively with the Big Data framework cases, as treated in a
selection of relevant works. However, Big Data technolo-
gies appear to enclose greater compilations of studies than
those specified in this context. As a future line of thought,
researchers could well lay greater focus on investigating
the other existing Big Data technologies, while establish-
ing systematic comparisons between them. For instance,
they could concentrate on more than just a single Big Data
technology liable to achieve the particular task, but with
different performance levels. Thus, it is required that future
studies should specify the relevant parameters necessary for
establishing the comparison between the existing Big Data-
related technologies. Taking as an example the instance of
Big Data storage solutions, people lacking experience might
well get confused as to whether it would be convenient to use
a database type of document or a graph-type one. Respond-
ing to such question often requires a large experience in the
domain of databases, which makes the establishment of such
a comparison extremely useful for the user.
Noteworthy, also, is that the present research draws a
global view of the social media analytics process, which
could even be further extended so as to focus on a specific
input of the analysis, such as, for instance, constructing a
social media analytics framework whereby knowledge could
be extracted from media types of data (e.g., photographs
or videos). The extended framework will be more specific
in terms of reviewing the techniques and algorithms used
for analyzing image or video data, along with the relevant
technologies useful for such purposes.
References
Aasman J (2006) Allegro graph: RDF triple database. Oakland Franz
Incorporated, Cidade
Abbasi A, Adjeroh DA, Dredze M, Paul MJ, Zahedi FM, Zhao H, Walia
N etal (2014) Social media analytics for smart health. IEEE
Intell Syst 29(2):60–80
Abramova V, Bernardino J (2013) NoSQL databases: MongoDB vs
cassandra. In: Proceedings of the international C* conference
on computer science and software engineering, ACM, pp 14–22
Achrekar H, Gandhe A, Lazarus R, Yu S-H, Liu B (2011) Predict-
ing flu trends using twitter data. In: Computer Communications
Workshops (INFOCOM WKSHPS), 2011 IEEE Conference on.
IEEE, pp 702–707
Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16(1):3–9
Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin
M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV, Labrinidis
A, Madden S, Papakonstantinou Y, Patel JM, Ramakrishnan R,
Ross K, Shahabi C, Suciu D, Vaithyanathan S, Widom J (2012)
Challenges and opportunities with big data—a community white
paper developed by leading researchers across the United States.
http://cra.org/ccc/docs/init/bigda tawhi tepap er.pdf
Agrawal R, Kadadi A, Dai X, Andres F (2015) Challenges and oppor-
tunities with big data visualization. In: Proceedings of the 7th
international conference on management of computational and
collective intElligence in digital EcoSystems, ACM, pp 169–173
Ahamed BB, Ramkumar T, Hariharan S (2014) Data integration pro-
gression in large data source using mapping affinity. In: 7th Inter-
national conference on advanced software engineering and its
applications (ASEA), IEEE, pp 16–21
Ashwin KTK, Kammarpally P, George KM (2016) Veracity of infor-
mation in twitter data: a case study. In: IEEE Computer Society
BigComp, pp 129–136
Atikoglu B, Xu Y, Frachtenberg E, Jiang S, Paleczny M (2012) Work-
load analysis of a large-scale key-value store. In: Harrison PG,
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 26 of 28
Arlitt MF, Casale G (eds) SIGMETRICS. ACM, New York, pp
53–64
Avvenuti M, Cresci S, Marchetti A, Meletti C, Tesconi M (2014) EARS
(earthquake alert and report system): a real time decision support
system for earthquake crisis management. In: Proceedings of
the 20th ACM SIGKDD international conference on knowledge
discovery and data mining, ACM, pp 1749–1758
Avvenuti M, Cresci S, Marchetti A, Meletti C, Tesconi M (2016) Pre-
dictability or early warning: using social media in modern emer-
gency response. IEEE Internet Comput 20(6):4–6
Baquero AV, Palacios RC, Molloy O (2016) Real-time business activ-
ity monitoring and analysis of process performance on big-data
domains. Telematics Inform 33(3):793–807
Baskar S, Arockiam L, Charles S (2013) A systematic approach on data
pre-processing in data mining. Compusoft 2(11):335
Batrinca B, Treleaven PC (2015) Social media analytics: a survey of
techniques, tools and platforms. AI Soc 30:89–116
Belcastro L, Marozzo F, Talia D (2018) Programming models and
systems for Big Data analysis. Int J Parallel Emerg Distrib
Syst. https ://doi.org/10.1080/17445 760.2017.14225 01
Bermbach D, Müller S, Eberhardt J, Tai S (2015) Informed schema
design for column store-based database services. In: SOCA,
IEEE Computer Society, pp 163–172
Bhuta S, Doshi A, Doshi U, Narvekar M (2014) A review of techniques
for sentiment analysis Of Twitter data. In: International confer-
ence on issues and challenges in intelligent computing techniques
(ICICT), IEEE, pp. 583–591
Bocconi S, Bozzon A, Psyllidis A, Bolivar CT, Houben G-J (2015)
Social glass: a platform for urban analytics and decision-making
through heterogeneous social data. In: Gangemi A, Leonardi S,
Panconesi A (eds) WWW (companion volume). ACM, New
York, pp 175–178
Bohlouli M, Dalter J, Dornhöfer M, Zenkert J, Fathi M (2015) Knowl-
edge discovery from social media using big data-provided senti-
ment analysis (SoMABiT). J Inf Sci 41(6):779–798
Bothos E, Apostolou D, Mentzas G (2010) Using social media to pre-
dict future events with agent-based markets. IEEE Intell Syst
25(6):50–58
Cambria E, Wang H, White B (2014) Guest editorial: big social data
analysis. Knowl-Based Syst 69:1–2
Cao J, Chawla S, Wang Y, Wu H (2017) Programming platforms
for Big Data analysis. In: Handbook of big data technologies.
Springer, pp 65–99
Carlson JL (2013) Redis in action. Manning Publications Co., Shelter
Island
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M,
Chandra T etal (2008) Bigtable: a distributed storage system
for structured data. ACM Trans Comput Syst (TOCS) 26(2):4
Chang RM, Kauffman RJ, Kwon Y (2014) Understanding the paradigm
shift to computational social science in the presence of big data.
Decis Support Syst 63:67–80
Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges,
techniques and technologies: a survey on Big Data. Inf Sci
275:314–347
Chen M, Ebert D, Hagen H, Laramee RS, Van Liere R, Ma K-L, Rib-
arsky W etal (2009) Data, information, and knowledge in visu-
alization. IEEE Comput Gr Appl 29(1):1–10
Cheng X, Liu J, Dale C (2013) Understanding the characteristics of
internet short video sharing: a YouTube-based measurement
study. IEEE Trans Multimed 15(5):1184–1194
Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S (2015)
One Trillion edges: graph processing at Facebook-scale. PVLDB
8:1804–1815
Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M,
Liu Z, Nusbaum K, Patil K, Peng B, Poulosky P (2016) Bench-
marking streaming computation engines: storm, flink and spark
streaming. In: IPDPS workshops, IEEE Computer Society, pp
1789–1792
Chodorow K (2013) MongoDB: the definitive guide. O”Reilly Media,
Inc., Newton
Corbellini A, Mateos C, Zunino A, Godoy D, Schiaffino S (2017) Per-
sisting big-data: the NoSQL landscape. Inf Syst 63:1–23
Cormode G, Krishnamurthy B (2008) Key differences between Web
1.0 and Web 2.0. First Monday 13(6)
Dang Y, Zhang Y, Hu PJ-H, Brown SA, Ku Y, Wang J-H, Chen H
(2014) An integrated framework for analyzing multilingual con-
tent in Web 2.0 social media. Decis Support Syst 61:126–135
Dean J, Ghemawat S (2008) MapReduce: simplified data processing
on large clusters. Commun ACM 51(1):107–113
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing
tool. Commun ACM 53:72–77
Dredze M (2012) How social media will change public health. IEEE
Intell Syst 27(4):81–84
Elgendy N, Elragal A (2014) Big data analytics: a literature review
paper. In Perner P (eds) Advances in data mining. Applications
and theoretical aspects. ICDM. Lecture notes in computer sci-
ence, vol 8557. Springer, Cham
Esposito C, Ficco M, Palmieri F, Castiglione A (2015) A knowledge-
based platform for Big Data analytics based on publish/subscribe
services and stream processing. Knowl-Based Syst 79:3–17
Fan W, Bifet A (2013) Mining big data: current status, and forecast to
the future. ACM SIGKDD Explor Newsl 14(2):1–5
Furht B, Villanustre F (2016) Introduction to Big Data. Big Data tech-
nologies and applications. Springer, Berlin, pp 3–11
Gandomi A, Haider M (2015) Beyond the hype: big data concepts,
methods, and analytics. Int J Inf Manag 35(2):137–144
Auradkar A, Botev C, Das S, De Maagd D, Feinberg A, Ganti P, Gao
L, etal. (2012) Data infrastructure at linkedin. In: IEEE 28th
international conference on data engineering (ICDE), IEEE, pp
1370–1381
Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system.
ACM SIGOPS operating systems review, vol 37. ACM, New
York, pp 29–43
Han J, Kamber M, Pei J (2011a) Data mining: concepts and techniques.
Elsevier, Amsterdam
Han J, Haihong E, Le G, Du J (2011b) Survey on NoSQL database. In:
6th international conference on pervasive computing and applica-
tions (ICPCA), IEEE, pp 363–366
Haryadi AF, Hulstijn J, Wahyudi A, Voort H, van der, Janssen M
(2016) Antecedents of big data quality: an empirical examina-
tion in financial service organizations. In: IEEE international
conference on Big Data (Big Data), IEEE, pp 116–121
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU
(2015) The rise of “big data” on cloud computing: review and
open research issues. Inf Syst 47:98–115
He W, Wang F-K, Akula V (2017) Managing extracted knowledge
from big social media data for business decision making. J Knowl
Manag 21(2):275–294
Hiba S, Mohamed Ali HT, Mohamed BA (2018) Popularity metrics’
normalization for social media entities. In: 20th International
Conference on Enterprise Information Systems, pp 525–535
Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big
data analytics: a technology tutorial. IEEE Access 2:652–687
Imran M, Castillo C, Diaz F, Vieweg S (2015) Processing social media
messages in mass emergency: a survey. ACM Comput Surv
47(4):67
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: dis-
tributed data-parallel programs from sequential building blocks.
ACM SIGOPS operating systems review, ACM, vol 41, pp 59–72
Jagadish H, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM,
Ramakrishnan R, Shahabi C (2014) Big data and its technical
challenges. Commun ACM 57(7):86–94
Social Network Analysis and Mining (2018) 8:30
1 3
Page 27 of 28 30
Ji X, Chun SA, Cappellari P, Geller J (2017) Linking and using social
media data for enhancing public health analytics. J Inf Sci
43(2):221–245
Jure L (2011) Social media analytics: tracking, modeling and predict-
ing the flow of information through networks. In: Proceedings of
the 20th international conference companion on World wide web
(WWW ‘11). ACM, New York, NY, USA, pp 277–278
Kaisler SH, Armour F, Espinosa JA, Money WH (2013) Big Data:
issues and challenges moving forward. In: IEEE Computer Soci-
ety HICSS, pp 995–1004
Kanhabua N, Romano S, Stewart A, Nejdl W (2012a) Supporting
temporal analytics for health-related events in microblogs. In:
Proceedings of the 21st ACM international conference on Infor-
mation and knowledge management, CIKM’12, ACM, Maui,
Hawaii, pp 2686–2688
Kaplan AM, Haenlein M (2010) Users of the world, unite! The chal-
lenges and opportunities of Social Media. Bus Horiz 53(1):59–68
Karpenko A, Aarabi P (2011) Tiny videos: a large data set for non-
parametric video retrieval and frame classification. IEEE Trans
Pattern Anal Mach Intell 33(3):618–630
Khan N, Yaqoob I, Hashem IAT, Inayat Z, Mahmoud Ali WK, Alam
M, Shiraz M etal (2014) Big data: survey, technologies, oppor-
tunities, and challenges. Sci World J 2014:1–18
Kotsilieris T, Pavlaki A, Christopoulou SC, Anagnostopoulos I (2017)
The impact of social networks on health care. Social Netw Anal
Min 7(1):18:1–18:6
Kumar V, Chadha A (2012) Mining association rules in student’s
assessment data. Int J Comput Sci Issues 9(5):211–216
Lennon, J. (2009). Introduction to couchdb. Beginning CouchDB, pp
3–9
Li N, Wu DD (2010) Using text mining and sentiment analysis for
online forums hotspot detection and forecast. Decis Support Syst
48(2):354–368
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM
(2012) Distributed GraphLab: a framework for machine learning
and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Magnusson J (2012) Social network analysis utilizing Big Data Tech-
nology. https ://www.diva-porta l.org/smash /get/diva2 :50975 7/
FULLT EXT01 .pdf
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Cza-
jkowski G (2010) Pregel: a system for large-scale graph process-
ing. In: Proceedings of the ACM SIGMOD international confer-
ence on management of data, ACM, pp 135–146
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers
A (2011) Big Data: the next frontier for innovation, competition,
and productivity
Mendoza M, Poblete B, Castillo C (2010) Twitter under crisis: can we
trust what we RT? In: Giles CL, Mitra P, Perisic I, Yen J, Zhang
H (eds) SOMA@KDD. ACM, New York, pp 71–79
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Free-
man J etal (2016) Mllib: machine learning in apache spark. J
Mach Learn Res 17(34):1–7
Middleton SE, Middleton L, Modafferi S (2014) Real-time crisis map-
ping of natural disasters using social media. IEEE Intell Syst
29(2):9–17
Mikolov T, Deoras A, Povey D, Burget L, Cernock J (2011) Strategies
for training large scale neural network language models. In: IEEE
Workshop on automatic speech recognition and understanding
(ASRU), IEEE, pp 196–201
Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed
stream computing platform. In: IEEE international conference
on data mining workshops (ICDMW), IEEE, pp 170–177
Newman R, Chang V, Walters RJ, Wills GB (2016) Web 2.0–the past
and the future. Int J Inf Manag 36(4):591–598
Nguyen DT, Hwang D, Jung JJ (2014) Time-frequency social data
analytics for understanding social big data. In: IDC, Studies in
Computational Intelligence, vol 570. Springer, pp 223–228
Oh C, Sasser S, Almahmoud S (2015) Social media analytics frame-
work: the case of Twitter and Super Bowl ads. J Inf Technol
Manag 26(1):1–18
Olshannikova E, Ometov A, Koucheryavy Y, Olsson T (2016) Visu-
alizing Big Data. In: Big Data technologies and applications,
Springer, pp 101–131
Orgaz GB, Jung JJ, Camacho D (2016) Social big data: recent achieve-
ments and new challenges. Inf Fus 28:45–59
Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2017) Big Data
technologies: a survey. J King Saud Univ Comput Inf Sci. https
://doi.org/10.1016/j.jksuc i.2017.06.001
Owen S, Owen S (2012) Mahout in action. Manning Publications Co.,
Shelter Island
Peng S, Wang G, Xie D (2017) Social influence analysis in social
networking big data: opportunities and challenges. IEEE Netw
31(1):11–17
Radicati S, Hoang Q (2011) Email statistics report 2011–2015. The
Radicati Group, Inc. A Technology Market Research Firm
Rahmani A, Chen AC-L, Sarhan A, Jida J, Rifaie M, Alhajj R (2014)
Social media analysis and summarization for opinion mining: a
business case study. Social Netw Anal Min 4(1):171
Reuter C, Scholl S (2014) Technical limitations for designing applica-
tions for social media. In: Butz A, Koch M, Schlichter JH (eds)
Mensch & Computer workshop band. De Gruyter Oldenbourg,
Berlin, pp 131–139
Rowley J (2007) The wisdom hierarchy: representations of the DIKW
hierarchy. J Inf Sci 33(2):163–180
Sakaki T, Okazaki M, Matsuo Y (2013) Tweet analysis for real-time
event detection and earthquake reporting system development.
IEEE Trans Knowl Data Eng 25(4):919–931
Sakr S (2016) Large-scale graph processing systems. In: Big Data 2.0
Processing Systems: A Survey, Springer, Cham, pp 53–73
Santhanam T, Padmavathi M (2014) Comparison of K-means clus-
tering and statistical outliers in reducing medical datasets. In:
International conference on science engineering and management
research (ICSEMR), IEEE, pp 1–6
Sapountzi A, Psannis KE (2016) Social networking data analysis
tools & challenges. Future Gener Comput Sys. https ://doi.
org/10.1016/j.futur e.2016.10.019
Schroeck M, Shockley R, Smart J, Romero-Morales D, Tufano P (2012)
Analytics: the real-world use of big data: How innovative enter-
prises extract value from uncertain data, Executive Report. In:
IBM Institute for Business Value and Said Business School at
the University of Oxford
Selvan LGS, Moh T-S (2015) A framework for fast-feedback opinion
mining on Twitter data streams. In: CTS, IEEE, pp 314–318
Siddiqa A, Hashem IAT, Yaqoob I, Marjani M, Shamshirband S, Gani
A, Nasaruddin F (2016) A survey of big data management: tax-
onomy and state-of-the-art. J Netw Comput Appl 71:151–166
Siddiqa A, Karim A, Gani A (2017) Big data storage technologies: a
survey. Front IT & EE 18:1040–1070
Skoric MM, Poor ND, Achananuparp P, Lim E-P, Jiang J (2012)
Tweets and votes: a study of the 2011 Singapore General Elec-
tion. In: IEEE Computer Society, HICSS, pp 2583–2591
Stenmark D (2002) Information vs. knowledge: the role of intranets in
knowledge management. In: Proceedings of HICSS. IEEE Press
Stieglitz S, Dang-Xuan L (2013) Social media and political communi-
cation: a social media analytics framework. Soc Netw Anal Min
3(4):1277–1291
Stieglitz S, Dang-Xuan L, Bruns A, Neuberger C (2014) Social media
analytics. Wirtschaftsinformatik 56(2):101–109
Social Network Analysis and Mining (2018) 8:30
1 3
30 Page 28 of 28
Stieglitz S, Mirbabaie M, Ross B, Neuberger C (2018) Social media
analytics—challenges in topic discovery, data collection, and
data preparation. Int J Inf Manag 39:156–168
Storey VC, Song I-Y (2017) Big data technologies and management:
what conceptual modeling can do. Data Knowl Eng 108:50–67
Strohbach M, Daubert J, Ravkin H, Lischka M (2016) Big data storage.
In: New horizons for a data-driven economy, Springer, Cham,
pp 119–141
Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase
framework and its current applications in bioinformatics. BMC
Bioinf 11(12):S1
Uddin MF, Gupta N etal. (2014) Seven V’s of Big Data understanding
Big Data to extract value. In: American Society for Engineering
Education (ASEE Zone 1), Zone 1 Conference of the IEEE, pp
1–5
Vatrapu R, Mukkamala RR, Hussain A, Flesch B (2016) Social set
analysis: a set theoretical approach to big data analytics. IEEE
Access 4:2542–2571
Vickery G, Wunsch-Vincent S (2007) Participative web and user-cre-
ated content: Web 2.0 wikis and social networking. Organization
for Economic Cooperation and Development (OECD)
Wang WY, Pauleen DJ, Zhang T (2016) How social media applications
affect B2B communication and improve business performance in
SMEs. Ind Mark Manag 54:4–14
Wang H, Xu Z, Pedrycz W (2017) An overview on the roles of fuzzy
set techniques in big data processing: trends, challenges and
opportunities. Knowl-Based Syst 118:15–30
White T (2012) Hadoop: the definitive guide. O”Reilly Media, Newton
Win SSM, Aung TN (2017) Target oriented tweets monitoring system
during natural disasters. In: Uehara K, Nakamura M (eds) ICIS,
IEEE Computer Society, pp 143–148
Wu Y, Cao N, Gotz D, Tan Y-P, Keim DA (2016) A survey on
visual analytics of social media data. IEEE Trans Multimed
18:2135–2148
Wu D, Sakr S, Zhu L (2017) Big data storage and data models. In:
Handbook of big data technologies, Springer, Cham, pp 3–29
Xin R, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I (2012)
Shark: SQL and rich analytics at scale. CoRR. abs/1211.6176
Yaqoob I, Hashem IAT, Gani A, Mokhtar S, Ahmed E, Anuar NB,
Vasilakos AV (2016) Big data: from beginning to future. Int J
Inf Manag 6(6):1231–1247
Yaqub U, Chun SA, Atluri V, Vaidya J (2017) Sentiment based analysis
of tweets during the US Presidential Elections. In: Hinnant CC,
Ojo A (eds) DG.O, ACM, New York, pp 1–10
Zeng D, Chen H, Lusch R, Li S-H (2010) Social media analytics and
intelligence. IEEE Intell Syst 25(6):13–16
... Fan and Gordon (2014) proposed a three-phase framework: Capture (data collection), Understand (data analysis to extract insights), and Present (presentation of findings). Sebei et al. (2018) illustrated the process pipeline, emphasizing the inherent challenges. Notably, prior research identifies two main phases in social media analytics: data management and data analysis (Fan and Gordon 2014;Lee 2018;Sebei et al. 2018). ...
... Sebei et al. (2018) illustrated the process pipeline, emphasizing the inherent challenges. Notably, prior research identifies two main phases in social media analytics: data management and data analysis (Fan and Gordon 2014;Lee 2018;Sebei et al. 2018). ...
... Social media data is diverse, including text, images, emojis, audio, and videos (Sebei et al. 2018;Verma et al. 2016). Scholars and practitioners refer to this as "big data" due to its substantial volume (Abkenar et al. 2021;Rahman and Reza 2022). ...
Article
Full-text available
Fashion brands including luxury brands are embracing TikTok to access young consumers, but there is a notable absence of research on how luxury fashion brands can leverage TikTok. Video analytics is crucial for understanding marketing communications via TikTok, a video-based social media platform. This study aims to examine how luxury brands establish their presence and effectively attract and engage with young consumers on TikTok through social media video analytics. A multiple case study approach was employed on the selected four luxury fashion brands. Data were collected from the selected brands’ official accounts, endorsed users’ accounts, and related hashtag links on TikTok. A three-stage content analysis of social media video analytics was conducted. The common and customized strategies employed across the selected brands on TikTok were identified, respectively. The findings revealed that young consumers prefer high-quality videos regarding branding messages, branded challenges, and influencers-led branded content. A consumer-brand engagement framework was proposed based on the data analysis. This research contributes to understanding how TikTok benefits the fashion industry and offers theoretical and practical insights for fashion brands to better harness TikTok. This study represents a pioneering endeavor in exploring social media video analytics, contributing to the advancement of marketing analytics literature.
... Social Media Analytics and Telegram Social media analytics (SMA) faces numerous challenges (Sebei, Hadj Taieb, and Ben Aouicha 2018) that are yet to be solved. For instance, the complexity of networks, diversity of platforms, and dynamics of social media platforms are causing difficulties in the application of SMA (Stieglitz et al. 2018). ...
Article
This paper examines the influence of scientific appearance (SA) on post dissemination and analyses a dataset of important actors in Germany, specifically those involved in the dissemination of disinformation on the social media platform Telegram. SA is identified through textual elements such as predefined keywords or digital object identifiers (DOIs). Characteristics and behaviours of actors with and without SA are compared using metadata such as forward counts and original posts. The additional content analysis provides insights into SA's usage and impact. The findings indicate that SA may influence the dissemination of posts and demonstrate how different methods can be applied for studying social media platforms.
... We adopted a methodology based on a data-driven approach (Sebei et al., 2018) consisting of three main steps: data collection, data processing and data analysis (Fig. 1). The collection of Wikiloc data in Auvergne was done in a Python environment (script available at this link: https://github.com/achaiallah-hub/Wiki4CES). ...
... Streaming data is related to data that is produced continuously in real-time. The speed of marking the generated data is evaluated in terms of the scale of the batch, near real-time, and real-time to reach streaming [10]. It is often generated by devices, social media feeds, sensors, or other sources that produce a constant flow of data. ...
Article
Full-text available
Big Data refers to the rapidly growing volume, variety, value, veracity, and velocity of data being generated in the modern digital world. It looks at the different kinds and sources of Big Data, such as structured data, semi-structured data, and unstructured data, highlighting the growing significance of the sources and elements of unstructured data from the social media perspective, Time Series Data, Geospatial Data, and Streaming Data. The three main challenges of Big Data, including characteristic challenges, processing challenges, and management challenges, are highlighted in this paper. This paper presents an overview of Big Data's characteristics, types, challenges, and various social media platforms. In conclusion, organizations not at all longer neglect unstructured data today; relatively, they are inventing means of evaluating it to extract information.
... Especially during the COVID-19 pandemic, social media has increasingly assumed a key role in providing information, spreading news and advertising [2] and has become one of the most effective digital marketing tools, with more companies embracing the power of social media analysis [2,3]. However, on the other hand, it has been highlighted in recent research works that only a small amount of this large portion of data gives added value, making the retrieval of valuable knowledge from this data a challenging task [4,5]. ...
Article
Full-text available
Text categorization and sentiment analysis are two of the most typical natural language processing tasks with various emerging applications implemented and utilized in different domains, such as health care and policy making. At the same time, the tremendous growth in the popularity and usage of social media, such as Twitter, has resulted on an immense increase in user-generated data, as mainly represented by the corresponding texts in users’ posts. However, the analysis of these specific data and the extraction of actionable knowledge and added value out of them is a challenging task due to the domain diversity and the high multilingualism that characterizes these data. The latter highlights the emerging need for the implementation and utilization of domain-agnostic and multilingual solutions. To investigate a portion of these challenges this research work performs a comparative analysis of multilingual approaches for classifying both the sentiment and the text of an examined multilingual corpus. In this context, four multilingual BERT-based classifiers and a zero-shot classification approach are utilized and compared in terms of their accuracy and applicability in the classification of multilingual data. Their comparison has unveiled insightful outcomes and has a twofold interpretation. Multilingual BERT-based classifiers achieve high performances and transfer inference when trained and fine-tuned on multilingual data. While also the zero-shot approach presents a novel technique for creating multilingual solutions in a faster, more efficient, and scalable way. It can easily be fitted to new languages and new tasks while achieving relatively good results across many languages. However, when efficiency and scalability are less important than accuracy, it seems that this model, and zero-shot models in general, can not be compared to fine-tuned and trained multilingual BERT-based classifiers.
... Especially during the COVID-19 pandemic, social media has increasingly assumed a key role in providing information, spreading news and advertising [2] and has become one of the most effective digital marketing tools, with more companies embracing the power of social media analysis [2,3]. However, on the other hand, it has been highlighted in recent research works that only a small amount of this large portion of data gives added value, making the retrieval of valuable knowledge from this data a challenging task [4,5]. ...
Article
Full-text available
Text categorization and sentiment analysis are two of the most typical natural language processing tasks with various emerging applications implemented and utilized in different domains, such as health care and policy making. At the same time, the tremendous growth in the popularity and usage of social media, such as Twitter, has resulted on an immense increase in user-generated data, as mainly represented by the corresponding texts in users' posts. However, the analysis of these specific data and the extraction of actionable knowledge and added value out of them is a challenging task due to the domain diversity and the high multilingualism that characterizes these data. The latter highlights the emerging need for the implementation and utilization of domain-agnostic and multilingual solutions. To investigate a portion of these challenges this research work performs a comparative analysis of multilingual approaches for classifying both the sentiment and the text of an examined multilingual corpus. In this context, four multilingual BERT-based classifiers and a zero-shot classification approach are utilized and compared in terms of their accuracy and applicability in the classification of multilingual data. Their comparison has unveiled insightful outcomes and has a twofold interpretation. Multilingual BERT-based classifiers achieve high performances and transfer inference when trained and fine-tuned on multilingual data. While also the zero-shot approach presents a novel technique for creating multilingual solutions in a faster, more efficient, and scalable way. It can easily be fitted to new languages and new tasks while achieving relatively good results across many languages. However, when efficiency and scalability are less important than accuracy, it seems that this model, and zero-shot models in general, can not be compared to fine-tuned and trained multilingual BERT-based classifiers.
Thesis
Full-text available
RESUMO: Esta tese investiga o fluxo informacional no Telegram, propondo um framework teórico para pesquisas desenvolvidas com dados extraídos desta plataforma digital híbrida, posicionada na intersecção entre aplicativo de mensagens e rede social. O trabalho foca na disseminação de desinformação política no Brasil. Uma revisão de escopo da literatura embasa o estudo, identificando pesquisas empíricas que utilizaram dados de grupos ou canais do Telegram. Analisaram-se métodos, procedimentos de seleção de fontes e ferramentas de coleta de dados. Destacou-se a relevância das funcionalidades e das affordances específicas. Adicionalmente, foi realizado um estudo de caso sobre grupos pró-Bolsonaro na campanha eleitoral de 2022, elucidando estratégias, atores e fluxos de (des)informação. A pesquisa aprofunda a compreensão do Telegram como espaço informacional complexo, explorando seu potencial de influência na propagação de desinformação política. A combinação de métodos digitais e análise de redes complexas permitiu a identificação de affordances do aplicativo e o desenvolvimento de uma taxonomia de ações de usuários. Estes elementos compõem o framework proposto, que serve como guia para futuras investigações, revelando aspectos distintos do Telegram e suas implicações no cenário informacional contemporâneo. Palavras-chave: Telegram; plataformas digitais; affordances; métodos digitais; campanhas eleitorais; desinformação. ABSTRACT: This thesis investigates the informational flow within Telegram, proposing a theoretical framework for research developed with data collected from this hybrid digital platform, positioned at the intersection of messaging app and social media. The study focuses on the dissemination of political disinformation in Brazil. A scoping review underpins the research, pinpointing empirical studies that used data from Telegram groups or channels. Methods, procedures for sources selection, and data collection tools were analyzed. The relevance of specific functionalities and affordances was emphasized. Additionally, a case study on pro-Bolsonaro groups during the 2022 electoral campaign was conducted, shedding light on strategies, actors, and flows of (dis)information. The research deepens the understanding of Telegram as a complex informational space, exploring its potential influence on the spread of political misinformation. The blend of digital methods and complex network analysis enabled the identification of app affordances and the development of a taxonomy of user actions. These elements constitute the proposed framework, serving as a guide for future inquiries, unveiling Telegram's unique facets and its implications in the contemporary informational landscape. Keywords: Telegram; digital platforms; affordances; digital methods; electoral campaigns; disinformation.
Article
Due to the explosive rise of online social networks, social network analysis (SNA) has emerged as a significant academic field in recent years. Understanding and examining social relationships in networks through network analysis opens up numerous research avenues in sociology, literature, media, biology, computer science, sports, and more. Therefore, certain studies review and discuss some research verticals of SNA, such as viral marketing, information diffusion, clustering, link prediction, etc., to provide background knowledge and understanding. These studies still lack the SNA process, tools, and practical aspects in multidisciplinary applications. Inspired by these facts, we have discussed the background, process, tools, and application of SNA. First, we have presented a detailed description of the SNA process. Thereafter, we presented a comparative analysis of SNA tools and languages. Finally, we have discussed the various application corresponding to SNA research verticals.
Article
Examining the particular value of each platform for big data would be difficult because of the variety of social media forms and sizes. Using social media to objectively and subjectively analyze large groups of individuals makes it the most effective tool for this task. There are numerous sources of big data within the organization. Social media can be identified by the interaction and communication it facilitates. Utilizing social media has become a daily occurrence in modern society. In addition, this frequent use generates data demonstrating the importance of researching the relationship between big data and social media. It is because so many internet users are also active on social media. We conducted a systematic literature review (SLR) to identify 42 articles published between 2018 and 2022 that examined the significance of big data in social media and upcoming issues in this field. We also discuss the potential benefits of utilizing big data in social media. Our analysis discovered open problems and future challenges, such as high‐quality data, information accessibility, speed, natural language processing (NLP), and enhancing prediction approaches. As proven by our investigations of evaluation metrics for big data in social media, the distribution reveals that 24% is related to data‐trace, 12% is related to execution time, 21% to accuracy, 6% to cost, 10% to recall, 11% to precision, 11% to F1‐score, and 5% run time complexity.
Article
Full-text available
Big Data analysis refers to advanced and efficient data mining and machine learning techniques applied to large amount of data. Research work and results in the area of Big Data analysis are continuously rising, and more and more new and efficient architectures, programming models, systems, and data mining algorithms are proposed. Taking into account the most popular programming models for Big Data analysis (MapReduce, Directed Acyclic Graph, Message Passing, Bulk Synchronous Parallel, Workflow and SQL-like), we analysed the features of the main systems implementing them. Such systems are compared using four classification criteria (i.e. level of abstraction, type of parallelism, infrastructure scale and classes of applications) for helping developers and users to identify and select the best solution according to their skills, hardware availability, productivity and application needs.
Article
Full-text available
Since an ever-increasing part of the population makes use of social media in their day-today lives, social media data is being analysed in many different disciplines. The social media analytics process involves four distinct steps, data discovery, collection, preparation, and analysis. While there is a great deal of literature on the challenges and difficulties involving specific data analysis methods, there hardly exists research on the stages of data discovery, collection, and preparation. To address this gap, we conducted an extended and structured literature analysis through which we identified challenges addressed and solutions proposed. The literature search revealed that the volume of data was most often cited as a challenge by researchers. In contrast, other categories have received less attention. Based on the results of the literature search, we discuss the most important challenges for researchers and present potential solutions. The findings are used to extend an existing framework on social media analytics. The article provides benefits for researchers and practitioners who wish to collect and analyse social media data.
Article
Full-text available
Developing Big Data applications has become increasingly important in the last few years. In fact, several organizations from different sectors depend increasingly on knowledge extracted from huge volumes of data. However, in Big Data context, traditional data techniques and platforms are less efficient. They show a slow responsiveness and lack of scalability, performance and accuracy. To face the complex Big Data challenges, much work has been carried out. As a result, various types of distributions and technologies have been developed. This paper is a review that survey recent technologies developed for Big Data. It aims to help to select and adopt the right combination of different Big Data technologies according to their technological needs and specific applications’ requirements. It provides not only a global view of main Big Data technologies but also comparisons according to different system layers such as Data Storage Layer, Data Processing Layer, Data Querying Layer, Data Access Layer and Management Layer. It categorizes and discusses main technologies features, advantages, limits and usages.
Conference Paper
Full-text available
In a relatively short period of time, social media has gained significant importance as a mass communication and public engagement tool for political and governance purposes. Rapid dissemination of information through social media platforms such as Twitter, provides politicians and campaigners with the ability to broadcast their message to a wide audience instantly and directly while bypassing the traditional media channels. In this paper, we investigate the nature and characteristics of the political discourse that took place on Twitter during the American Presidential elections of November 2016. The goal of this study is to perform exploratory sentiment based analysis of Twitter data that was gathered both before and after the Election Day. Our objective is to identify the nature and sentiment of discussions along with understanding the behavior of users with respect to their Twitter profile and associated attributes of their tweets. We also aim to inspect popular Twitter discussion topics and their relation with important news and events occurring simultaneously.
Article
Full-text available
Our work examines the risks and benefits stemming from the evolution of Social Network Services (SNSs) in the healthcare domain. More specifically, we study the impact of specific health-oriented social networks such as PatientsLikeMe. Social networks evolved to a ubiquitous part of daily life and WEB 2.0 paved the way for the internet to be used as a method of interactive communication and information immersion. Health SNSs have the strength to influence healthcare services delivery and information availability supported by emerging technologies which track, gather and quantify real-time medical data from patients. SNSs support publicly provided information to patients, offering them the power not only to educate themselves but take part in the decision-making process of their health. On the other hand, healthcare stakeholders have gained access to new information which can help to cut costs, progress research, and improve the healthcare system. However, apart from the unambiguous benefits of SNSs, several risks are identified such as patient confidentiality violation. By incorporating the volumes of data collected by websites like PatientsLikeMe and other WEB 2.0 applications, the patient–industry partnership could ensure better products at lesser costs. Web 3.0 is the next step toward a heath care eco-system which will evolve out of micro-contributions creating the most accurate representations of medicine for the stakeholders.
Article
Data pre-processing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining Soil classification. Data pre-processing is a first step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better analysis and ANN training. Based on the collected data from the field as well soil testing laboratory, data analysis is performed more accurately and efficiently. Data pre-processing is challenging and tedious task as it involves extensive manual effort and time in developing the data operation scripts. There are a number of different tools and methods used for pre-processing, including: sampling, which selects a representative subset from a large population of data; transformation, which manipulates raw data to produce a single input; denoising, which removes noise from data; normalization, which organizes data for more efficient access; and feature extraction, which pulls out specified data that is significant in some particular context. Pre-processing technique for soil data sets are also useful for classification in data mining
Article
There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed ‘big data’. The structural shift of the storage mechanism from traditional data management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mechanism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the existing approaches using Brewer’s CAP theorem. The significance and applications of storage technologies and support to other categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a reliable and scalable storage system.
Article
Purpose This paper aims to propose a knowledge management (KM) framework for leveraging big social media data to help interested organizations integrate Big Data technology, social media and KM systems to store, share and leverage their social media data. Specifically, this research focuses on extracting valuable knowledge on social media by contextually comparing social media knowledge among competitors. Design/methodology/approach A case study was conducted to analyze nearly one million Twitter messages associated with five large companies in the retail industry (Costco, Walmart, Kmart, Kohl’s and The Home Depot) to extract and generate new knowledge and to derive business decisions from big social media data. Findings This case study confirms that this proposed framework is sensible and useful in terms of integrating Big Data technology, social media and KM in a cohesive way to design a KM system and its process. Extracted knowledge is presented visually in a variety of ways to discover business intelligence. Originality/value Practical guidance for integrating Big Data, social media and KM is scarce. This proposed framework is a pioneering effort in using Big Data technologies to extract valuable knowledge on social media and discover business intelligence by contextually comparing social media knowledge among competitors.