ArticlePDF Available

A framework for social media data analytics using Elasticsearch and Kibana

Authors:

Abstract and Figures

Real-time online data processing is quickly becoming an essential tool in the analysis of social media for political trends, advertising, public health awareness programs and policy making. Traditionally, processes associated with offline analysis are productive and efficient only when the data collection is a one-time process. Currently, cutting edge research requires real-time data analysis that comes with a set of challenges, particularly the efficiency of continuous data fetching within the context of present NoSQL and relational databases. In this paper, we demonstrate a solution to effectively adsress the challenges of real-time analysis using a configurable Elasticsearch search engine. We are using a distributed database architecture, pre-build indexing and standardizing the Elasticsearch framework for large scale text mining. The results from the query engine are visulized in almost real-time.
This content is subject to copyright. Terms and conditions apply.
A framework for social media data analytics using Elasticsearch
and Kibana
Neel Shah
1
Darryl Willick
1
Vijay Mago
1
Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Real-time online data processing is quickly becoming an essential tool in the analysis of social media for political trends,
advertising, public health awareness programs and policy making. Traditionally, processes associated with offline analysis
are productive and efficient only when the data collection is a one-time process. Currently, cutting edge research requires
real-time data analysis that comes with a set of challenges, particularly the efficiency of continuous data fetching within the
context of present NoSQL and relational databases. In this paper, we demonstrate a solution to effectively adsress the
challenges of real-time analysis using a configurable Elasticsearch search engine. We are using a distributed database
architecture, pre-build indexing and standardizing the Elasticsearch framework for large scale text mining. The results from
the query engine are visulized in almost real-time.
Keywords Social media Big data Real-time analysis Elasticsearch Visualization
1 Introduction
The exponential growth of online data poses a significant
challenge in the process of fetching a representative data
set that can be translated into tangible results [1,2]. Pre-
processing in real-time adds another layer of complexity,
especially when the data is textual and unstructured [3]or
crowd sourced [4]. Solutions to processing big data sets in
the fields of cloud computing and storage are growing at
rapid speed, but when we consider big data on a scale of
petabytes [5], cloud based analytics are limited by network
inefficiencies for transporting the data; and recurring costs
for the computational resources required to perform anal-
ysis in real-time [6]. Access and privacy also pose a
challenge in cloud based storage as server administrators
maintain the rights to view both the data and its flow.
Security solutions such as encrypted searching are not
feasible to implement specific to real-time analysis because
of computational limitations [7]. Currently, the top three
tools used for analyzing large databases are Elasticsearch,
Hadoop and Spark [8]. Elasticsearch is a distributed search
and analytical engine which allows for real-time data
transformations, search queries, document stream pro-
cessing and indexing at a relatively high speed. Addition-
ally, Elasticsearch can index numbers, geographical
coordinates, dates and almost any datatype while support-
ing multiple languages (i.e., Python, Java, Ruby). The
speed of the Elasticsearch engine is founded on its ability
to perform aggregation, searching and processing the index
of the data [9]. Hadoop is a distributed batch computing
platform, using the MapReduce algorithm, that includes
data extraction and transformation capabilities. While the
platform is based on NoSQL technology that makes
uploading unstructured data easy, its query processing
HBASE does not have advanced analytical search capa-
bilities like Elasticsearch. Elasticsearch is a text search and
analytics tool with a visualization plugin for real-time
analysis with an open source license. Finally, Elasticsearch
hosts plugins for Hadoop and Spark to reduce the distance
between the two different technologies and allows for a
hybrid system to be implemented [10].
&Vijay Mago
vmago@lakeheadu.ca
Neel Shah
nshah5@lakeheadu.ca
Darryl Willick
dwillic1@lakeheadu.ca
1
Deparment of Computer Science, Lakehead University,
Thunder Bay, ON, Canada
123
Wireless Networks
https://doi.org/10.1007/s11276-018-01896-2(0123456789().,-volV)(0123456789().,-volV)
Tools that support the management of large data sets
and real-time data fetching include relational (MySQL,
Oracle Database, SQLite), Graph (Neo4j, Oracle Spatial)
and NoSQL (MongoDB, IBM Domino, Apache CouchDB).
Limiting factors related to all types of databases include
lack of support for full-text searches in real-time. While
NoSQL is functional for full text searching it lacks relia-
bility when compared to relational database models [3].
Traditional databases require that the data is first uploaded
and then the administrator must actively decide which data
should be indexed which adds one more layer of processing
making it infeasible for real-time analysis. Elasticsearch
provides a solution to these limiting factors [3] by pro-
viding a highly efficient data fetching and real-time anal-
ysis system that:
Performs pre-indexing before storing the data to avoid
the need to fetch and query specific data in real-time;
Requires limited resources and computing power in
relation to traditional solutions; and
Provides a system that is distributed and easy to scale.
The capacity for Elasticsearch to contribute to high effi-
ciency, real-time data analysis is enhanced through a
standardized configuration process, shard size management
and standardizing the data before upload into Elasticsearch
and demonstrated through a discussion of both the working
architecture as well as a real-time visualization of social
media data collected during December 2017 and May
2018, a repository of over 1 billion twitter data points.
1.1 Key contributions
Optimizing and standardizing twitter data for
Elasticsearch
Creating a configuration file and choosing the optimal
shard size
Demonstrating the real-time visualization of a very
large scale social media data set
2 Architecture for real-time analysis
and storage
2.1 Elasticsearch
Elasticsearch was started in the year 2004 as an open
source project called compass, which was based on Apache
Lucene [11]. Elasticsearch is a distributed and scalable
full-text search engine written in Java that is stable and
platform independent. These features combined with
requirement specific flexibility and easy expansion options
are helpful for real-time big data analysis [12]. We will
discuss some of the general functions of Elasticsearch to
provide context for the Elasticsearch configurization and
data standardization and shard management procedure
resulting from this research.
2.2 Abstract view
Figure 1illustrates the framework for real-time analysis of
very large scale data based on Elasticsearch and Kibana
[13]. In the first step, the Twitter API is used for scraping
twitter data (approximately 1400 tweets per minute) that is
stored in a MongoDB database, which is installed on a
Network Attached Storage (NAS) with a capacity of 16TB.
The twitter data is transfered to preprocssing units which
handle the data and transfer it to High Performance Com-
puting (HPC) infrastructure in almost real-time. As tradi-
tional databases, including MongoDB, are not efficient
enough to handle real-time query, we transfer the pro-
cessing and analsis of data to Elasticsearch, which is
implemented via HPC lab resources. Before uploading the
data, we standardize the twitter object for Elasticsearch and
use multithreading to upload the data for better real-time
performance and to shorten the gap between receiving and
processing data. When a user needs any data, a query will
be sent to Elasticsearch using the Kibana front-end. Elas-
ticsearch processes that query and sends the query result
object (JSON format) to Kibana, where Kibana shows the
query object to the user.
Within the general functioning of the search engine,
Elasticsearch uses a running instance called a node which
can take on one or more roles including a master or a data
node (see Sect. 2.1, Fig. 2). Dataset clusters within Elas-
ticsearch require at least one master and one data node,
however it is possible that a cluster can consist of a single
node since a node may take on multiple roles. The only
data storage format compatible with Elasticsearch is JSON
and therefore requires data mapping for producing func-
tional analysis and visualizations due to the unstructured
format of the twitter data. We observed that reliance on the
JSON format makes the system more flexible than MySQL
and other RDBMS, but less than MongoDB. While a tra-
ditional database such as RDBMS use tables to store the
data, MongoDB uses BSON (like JSON) format, and
Elasticsearch uses an inverted index via the Apache Lucene
architecture to store the data [11]. A typical index in
Elasticsearch is a collection of documents with different
properties that have been organized through user defined
mapping that outlines document types and fields for dif-
ferent data sources; similar to a table in an SQL database.
The index is then split into shards housed in multiple nodes
where a shard is part of an index distributed on different
nodes. Within the Elasticsearch framework, the inverted
index allows a more categorical storage of big data sets
Wireless Networks
123
within nodes and shards so that real-time search queries are
more efficient. Elasticsearch uses RESTful API to com-
municate with users, see Table 1for a basic architecture
comparison. Additionally, there are different libraries such
as Elasticsearch in Python [14] and Java [15] for better
integration.
2.2.1 Backbone
While Elasticsearch is a powerful tool, a model is required
to optimize functionality for the purpose of real-time big
data analysis specific to social media. The purpose of this
research is to provide (1) a specific configuration file to
optimize the organization of the data set, (2) an optimized
shard size for maximum efficiency in storage and pro-
cessing, and (3) a standardized structure for data fields
present within Twitter to eliminate over-processing of
irrelevant information When the data is stored in Elastic-
search, it stores the data in an index first, and then the index
data is stored as an inverted-index using an automatic
tokenizer. When we search in Elasticsearch, we get a
‘snapshot’ of the data, which means that Elasticsearch does
not require the hosting of actual content but instead links to
documents stored within a node to provide a result through
Fig. 1 Framework for real-time analysis using Elasticsearch
Fig. 2 Elasticsearch cluster architecture hosted on the HPC at Lakehead University
Table 1 Comparison between
Elasticsearch and RDBMS basic
architecture
Elasticsearch RDBMS
Index Database
Mapping Table
Document Tuple
Wireless Networks
123
the inverted index. These results are not real data but a
representation of the query’s linkages to all associated
documents stored in each node. As a component of this
project, the following configuration file was developed and
can be replicated in Elasticsearch on any HPC by editing
the config files as per number of nodes and capacity of
server. Table 2describes the basic configuration file for
Elasticsearch.
Here, the name of a cluster is dslab and a cluster name is
necessary, even if only a single node is present. As the
Elasticsearch is a scattered database, where one or many
nodes work as heads and others as data, this parameter is
used to interconnect all the nodes in the cluster. We can
create numerous clusters with the same hardware using
different instances of Elasticsearch and different configu-
ration files.
Table 3is an example of a configuration file features for
any Elasticsearch node. In every node for the distributed
Elasticsearch we have to configure the same file in each
and every instance. When the data is stored we use the
index to store a specific type of data similar to a dataset in
MySQL. The performance of Elasticsearch is based on the
mapping of the index and how we size the shards of the
data set. The formula to decide the size of the shards is
given in Eq. 1.
Number of shards ¼ðSize of index in GBÞ=50 ð1Þ
The reason behind the consideration of using 50 GB as a
shard size is due to the architecture in Elasticsearch. The
architecture supports 32 GB index size and 32 GB cache
memory so ideally the shard’s memory should be less than
64 GB and through experimentation we observed that the
best results are achieved at shard size of 50 GB.
2.3 Kibana: visualization
In addition to Elasticsearch being efficient for real-time
analysis, extended plugins such as kibana [13] and
Logstash [16] make it convenient for functional represen-
tations of big data in real-time. It is part of the elastic stack
and is freely available under open source license. Kibana
has multiple standard visualizations available by default
and simplifies the process of developing visualizations for
end users with a drag and drop feature. As Kibana is
backed by the Elasticsearch architecture, it functions
quickly and is efficient enough for real-time analysis.
Finally it provides the opportunity for graphical interaction
in the process of building and handling queries with an
accessible visualization of the cluster health and properties
within the database.
3 Social media data analysis
3.1 Configuration of the Elasticsearch
Live social media streaming data is stored in elastic clus-
ters. Each elastic cluster contains 6 nodes, with each node
having 2 threads and 12 GB of memory. Within these 6
nodes one node works as a master and the remaining 5
work as data nodes. Architecture of the elastic cluster is
shown in Fig. 2.
3.2 Social media dataset
We used Elasticsearch to analyze 250?million out of 1
billion tweets scraped between December 2017 and May
2018 using the Twitter API. Since the Twitter API response
is in JSON format and contains unstructured and incon-
sistent data the sequential collection of all data fields
within the tweet JSON object is not guaranteed. Stan-
dardization of the data and conversion into a structured
format is therefore necessary for Elasticsearch mapping so
that each field of data is present when loaded into the
index. To optimize the Elasticsearch we changed the
storage format of the tweet so that all the data is required to
Table 2 Master and data node
configuration file Master node config file Data node config file
cluster.name: dsla cluster.name: dslab
node.name: m1 node.name: d1
node.master: true node.master: false
node.data: true node.data: true
path.data: /data/nshah5/dataset path.data: /data/nshah5/dataset
path.logs: /data/nshah5/log path.logs: /data/nshah5/log
network.host: x.x.x.x network.host: x.x.x.x
network.bind_host: 0 network.bind_host: 0
network.publish_host: x.x.x.x network.publish_host: x.x.x.x
discovery.zen.ping.unicast.hosts: [‘‘x.x.x.x’’] discovery.zen.ping.unicast.hosts: [‘‘x.x.x.x’’]
bootstrap.system_call_filter: false bootstrap.system_call_filter: false
Wireless Networks
123
be at depth level one in JSON format. Table 4depicts the
basic example of restructured data in Elasticsearch.
As we mentioned previously, the data is stored as an
inverted index that is optimized for text searches and
therefore very efficient. For example, if we search for the
keyword ‘‘pizza’’ within the context of all tweets (250?
millions) in Elasticsearch, the time taken is 4060 ms
(4.06 s) to find a total of 192,118 tweets where the ‘‘pizza’
keyword is present in tweet text. Table 5shows the
example of the keyword ‘‘pizza’’ text search query
response from Elasticsearch. Figure 3a shows a pie chart of
tweets mapping the geographical distribution by nation of
‘pizza’’ tweets where the United States alone is responsi-
ble for 47% of total tweets and other countries excluding
the top five are 30%, which is 77% of total tweets. Addi-
tionally, the visualization shows the time taken to perform
the query is 13 ms (0.013 s). Figure 3b shows five most
used languages in the tweet text related to ‘‘pizza’’ where
the English language is used in more than 77% tweets
while Spanish is used 12%, Portuguese at third spot with
6%, French at 3% and Japanese at 2% tweets. In this
instance Elasticsearch took 17 ms for query processing.
Figure 3c shows the devices used to tweet with 38% of
tweets coming from the iPhone twitter app, the Android
twitter app was used for 29%, twitter web clients were used
for only 11% and Twitter lite and Tweetdeck combined
were used for around 7%. Other sources were indicated for
the remaining 15% tweets. This query took 11 ms to exe-
cute, which is quite reasonable given the structure and
amount of data.
The above results demonstrate the efficiency of this data
analysis system in that all three tasks (fetching the data,
performing descriptive analysis and creating graphs), were
accomplished in less than 15 s from a database size of
250?million tweets. Clearly, this framework has proven
Table 3 Elasticsearch node configuration file features
Config file properties Explanation
cluster.name It is the name of cluster where present node will join.
node.name It gives the name of your current node
node.master The role of master-eligible is decided based on true or false function (Boolean function). The master node manages the
overall state of the cluster including node monitoring, index creation and deletion, and shard to node assignments.
node.data The role of data is decided based on true or false function (Boolean function). It stores the physical data shards, performs
reads, writes, searches and aggregations. Any node can be master and data, both or individual.
path.data The location of the actual data in present node is represented.
path.logs Location where the logs of the present nodes are stored. Logs are important to diagnose problems and monitor working
status.
network.host It’s an address of the present node which is unique for the individual node in the cluster.
network.publish_host It’s a public address where other nodes communicate with the present node.
Table 4 Difference between normal and updated structure
Original tweet structure Updated structure
{{
‘??Tweet’’:{ ‘‘Id’’:
‘User’’??:{ ‘‘Name’’:
‘Id’’??: ...
‘Name’’??: }
}
},
...
}
Table 5 Search query result of ‘‘pizza’’ keyword
Result of keyword ‘‘pizza’’ from all tweets from database
{
‘took’’: 4060,
‘timed out’’: false,
‘shards’’: {
‘total’’: 106,
‘successful’’: 106,
‘skipped’’: 0,
‘failed’’: 0
}
‘hits’’: {
‘total’’: 192118,
‘max_score’’: 15.110959,
’hits’’: [???]
}
Wireless Networks
123
suitable for the analysis of large text data in real-time
without losing accuracy. It also shows that the restructuring
and standardization procedures used on the data assisted in
optimizing the accuracy of the results and efficiency of the
processes in a context with limited resources.
3.3 Visualization dashboard
At present, the monitoring framework described in this
paper is used to display data coming from Twitter stream.
For example, in Fig. 4we show a snapshot of the Kibana
dashboard. The top-most plot is a pie chart of tweet source,
which displays the results from which device they use to
tweet, such as iPhone, web browser etc. The second top-
most plot is pie chart of the languages used to tweet. In the
middle, first histogram shows the the time and amount of
twitter data flow. And, the second shows the word cloud
and the bottom left shows the top ten users who are actively
twitting. Similar dynamic dashboard creation is possible in
minutes without knowledge of any programming knowl-
edge and back-end system understanding.
3.4 Limitation
As Elasticsearch is designed to be used for real-time
analysis, there are databases which provide functions that
perform better in offline mass data analysis such as NoSQL
databases (e.g., MongoDB) that support MapReduce [3].
Elasticsearch does not support MapReduce as it instead
relies on the inverted index [17]. Additionally, Elastic-
search can be slow when new data is added to the index and
it currently lacks support for more popular data formats
(e.g., XML, CSV) and only supports JSON format which
can be challenging for users unfamiliar with JSON [18].
4 Related work
Marcos [6] suggests that cloud computing is elastic in
nature as the user can adjust it as per his/her data needs
from processing power to storage. While it does seem ideal
in theory, cloud computing comes with several challenges
including both network inefficiency in data transport as
well as issues related to data privacy and access control.
Additionally, Hashem refers to ‘data stabbing’, which are
problems associated with storing and analyzing the
heterogenous and complex structure of big datasets [19].
As a solution, other authors such as Oleksii [3] support and
highlight the benefits of Elasticsearch as a tool for real-time
analysis in modern data mining repositories. In this
research we attempted to address and resolve problems
associated with data preprocessing and efficiency while
also discussing the elastic cluster framework in more depth.
Fig. 3 Real-time analysis of Twitter data for the term ‘‘pizza’
Wireless Networks
123
Currently there are very few research studies on frame-
works for big data analysis in real-time although several
discuss the application of practices in manufacturing [20]
and gene coding [21]. Some researchers have used Elas-
ticsearch cluster via a logstash plugin and MySQL data-
bases for heterogenous accounting information system
[22]. The data is monitored using MySQL server before
inserting it into Elasticsearch. The researchers observed
that there might be an issue of duplication of data and
storage space, but the architecture ensures flexibility and
modularity for the monitoring the system. They choose
Elasticsearch as text search engine in real-time which
allows them to search historical data. Mayo Clinic
healthcare system developed a big data hybrid system
using Hadoop and Elasticsearch technology. In healthcare,
real-time result is essential for effective decision making.
Before that, they used traditional RDBMS database to store
and process data. But, it lacks integration between different
platforms and inability to querying/ingest of healthcare
data in a real-time or near real-time. In Mayo Clinic system
Hadoop is used as a distributed file system and on top of it
Elasticsearch works as a real-time text search engine.
When there is a need for raw data Hadoop is used, and for
real-time analysis Elasticsearch is used. Their experimen-
tation showed very promising results, like searching 25.2
million HL7 records took just 0.21 s [23].
Designsafe web portal by Natural Hazards Engineering
Research(NHER) analyze and share experimental data in
real-time with researchers across the world. The user of
their system sends the large amount of data which is stored
in distributed NFS. During the preprocessing of the data,
which includes analysis of string and basic cleaning, they
index the data and make it compatible for Elasticsearch.
This model allows users in a different location to query the
same experimental data which is computed in different part
of the world in real-time. All these present environments
needs to be correctly configured as per the data and the
requirements [24].
5 Conclusion
Elasticsearch provides a functional system to store, pre-
index, search and query very large scale data in real-time.
In particular, the capability of expanding the cluster size
without stopping service as per user’s requirement makes it
suitable for this application. This research provides insights
on how to standardize and configure the processes of
Elasticsearch which result in increased analysis efficiency.
To demonstrate the functionality and interactivity for users,
the Kibana plugin was used as an interface. In conclusion, a
proper configuration of Elasticsearch and Kibana makes
real-time analysis of large scale data efficient and can help
policy makers see the results instantaneously and in an
accessible format that allows for decision making.
Acknowledgements This research is funded by the NSERC Discov-
ery Grant; computing resources are provided by the High Perfor-
mance Computing (HPC) Lab and Department of Computer Science
Fig. 4 Partial view of the Kibana dashboard for the twitter analysis
Wireless Networks
123
at Lakehead University, Canada. Authors are grateful to Gaurav
Sharma for initially setting up the data collection stream, Salimur
Choudhury for providing insight on the data analysis and Andrew
Heppner for reviewing and editing drafts.
References
1. Cervellini, P., Menezes, A. G., & Mago, V. K. (2016). Finding
trendsetters on yelp dataset. In 2016 IEEE symposium series on
computational intelligence (SSCI) (pp. 1–7). IEEE.
2. Belyi, E., Giabbanelli, P. J., Patel, I., Balabhadrapathruni, N. H.,
Abdallah, A. B., Hameed, W., et al. (2016). Combining associ-
ation rule mining and network analysis for pharmacosurveillance.
The Journal of Supercomputing,72(5), 2014–2034.
3. Kononenko, O., Baysal, O., Holmes, R., & Godfrey, M. W.
(2014). Mining modern repositories with Elasticsearch. In Pro-
ceedings of the 11th working conference on mining software
repositories (pp. 328–331). ACM.
4. Liu, Q., Kumar, S., & Mago, V. (2017). Safernet: Safe trans-
portation routing in the era of internet of vehicles and mobile
crowd sensing. In 2017 14th IEEE annual consumer communi-
cations and networking conference (CCNC) (pp. 299–304). IEEE.
5. Kim, M. G., & Koh, J. H. (2016). Recent research trends for
geospatial information explored by twitter data. Spatial Infor-
mation Research,24(2), 65–73.
6. Assunc¸a
˜o, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., &
Buyya, R. (2015). Big data computing and clouds: Trends and
future directions. Journal of Parallel and Distributed Computing,
79, 3–15.
7. Bsch, C., Hartel, P., Jonker, W., & Peter, A. (2014). A survey of
provably secure searchable encryption. ACM Computing Surveys,
47(2), 18:1–18:51. https://doi.org/10.1145/2636328.
8. Kumar, P., Kumar, P., Zaidi, N., & Rathore, V. S. (2018).
Analysis and comparative exploration of elastic search, Mongodb
and Hadoop big data processing. In Soft computing: Theories and
applications, (pp. 605–615). New York: Springer.
9. Cea, D., Nin, J., Tous, R., Torres, J., & Ayguade
´, E (2014).
Towards the cloudification of the social networks analytics. In
Modeling decisions for artificial intelligence (pp. 192–203). New
York: Springer.
10. Bai, J. (2013). Feasibility analysis of big log data real time search
based on hbase and elasticsearch. In 2013 ninth international
conference on natural computation (ICNC) (pp. 1166–1170).
IEEE.
11. Elasticsearch-elastic.co. Retrieved April 30, 2018, from https://
www.elastic.co/guide/en/elasticsearch/reference/6.2/index.html.
12. Gormley, C., & Tong, Z. (2015). Elasticsearch: The definitive
guide: A distributed real-time search and analytics engine.
Sebastopol: O’Reilly Media, Inc.
13. Your Window into the Elastic Stack. Retrieved 30, 2018, from
https://www.elastic.co/products/kibana.
14. Python Elasticsearch Client. Retrieved April 30, 2018, from
https://elasticsearch-py.readthedocs.io/en/master/.
15. Java Elasticsearch library-Elastic. Retrieved April 30, 2018, from
https://www.elastic.co/guide/en/Elasticsearch/client/java-api/6.2/
index.html.
16. Getting Started with Logstash. Retrieved April 30, 2018, from
https://www.elastic.co/guide/en/logstash/current/getting-started-
with-logstash.html.
17. Yang, F., Tschetter, E., Le
´aute
´, X., Ray, N., Merlino, G., &
Ganguli, D. (2014). Druid: A real-time analytical data store. In
Proceedings of the 2014 ACM SIGMOD international conference
on Management of data (pp. 157–168). ACM.
18. Burkitt, K. J., Dowling, E. G., & Branon, T. R. (2014). System
and method for real-time processing, storage, indexing, and
delivery of segmented video. US Patent 8,769,576.
19. Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A.,
& Khan, S. U. (2015). The rise of big data on cloud computing:
Review and open research issues. Information Systems,47,
98–115.
20. Yang, H., Park, M., Cho, M., Song, M., & Kim, S. (2014). A
system architecture for manufacturing process analysis based on
big data and process mining techniques. In 2014 IEEE interna-
tional conference on big data (pp. 1024–1029). IEEE.
21. Stelzer, G., Plaschkes, I., Oz-Levi, D., Alkelai, A., Olender, T.,
Zimmerman, S., et al. (2016). Varelect: The phenotype-based
variation prioritizer of the genecards suite. BMC Genomics,
17(2), 444.
22. Bagnasco, S., Berzano, D., Guarise, A., Lusso, S., Masera, M., &
Vallero, S. (2015). Monitoring of IAAS and scientific applica-
tions on the cloud using the elasticsearch ecosystem. In Journal
of physics: Conference series (Vol. 608, p. 012016). Bristol: IOP
Publishing.
23. Chen, D., Chen, Y., Brownlow, B. N., Kanjamala, P. P., Arre-
dondo, C. A. G., Radspinner, B. L., et al. (2017). Real-time or
near real-time persisting daily healthcare data into hdfs and
elasticsearch index inside a big data platform. IEEE Transactions
on Industrial Informatics,13(2), 595–606.
24. Coronel, J. B., & Mock, S. (2017). Designsafe: Using elastic-
search to share and search data on a science web portal. In
Proceedings of the practice and experience in advanced research
computing 2017 on sustainability, success and impact (p. 25).
ACM.
Neel Shah is a graduate student
at Lakehead University, Canada
Currently, he is working on
analyzing social media data to
gain insight of Canadian healthy
behaviours. He is an active open
source coder and maintains two
open-source Python libraries.
His core areas of interest are
deep learning and data science.
Darryl Willick received the B.Sc.
(1988) and M.Sc. (1990)
degrees in Computational Sci-
ence from the University of
Saskatchewan, Canada.
Throughout his career he has
worked in the areas of High
Performance Computing, Visu-
alization, System administra-
tion, and Cyber Security.
Currently he is a Technology
Security Specialist/HPCC Ana-
lyst at Lakehead University,
Canada.
Wireless Networks
123
Vijay Mago is also an Associate
Professor in the Department of
Computer Science at Lakehead
University in Ontario, Canada
where he teaches and conducts
research in areas including big
data analytics, machine learn-
ing, natural language process-
ing, artificial intelligence,
medical decision making and
Bayesian intelligence. He
received his Ph.D. in Computer
Science from Panjab University,
India in 2010. In 2011 he joined
the Modelling of Complex
Social Systems program at the IRMACS Centre of Simon Fraser
University. He has served on the program committees of many
international conferences and workshops. Recently in 2017, he joined
Technical Investment Strategy Advisory Committee Meeting for
Compute Ontario. He has published extensively (more than 50 peer
reviewed articles) on new methodologies based on soft computing and
artificial intelligent techniques to tackle complex systemic problems
such as homelessness, obesity, and crime. He currently serves as an
associate editor for IEEE Access and BMC Medical Informatics and
Decision Making and as co-editor for the Journal of Intelligent
Systems.
Wireless Networks
123
... Elasticsearch is a popular solution for data storage and processing, as evidenced by the 512 works indexed by Scopus. Most papers explain the choice of Elasticsearch [3], and the ELK set is used as a tool, as in [4][5][6], without defining either the index settings or the fields. If some ELK settings are given in the article, it is mostly related to the architecture [3,7,8] or examples [8] [9, p.114]. ...
... Most papers explain the choice of Elasticsearch [3], and the ELK set is used as a tool, as in [4][5][6], without defining either the index settings or the fields. If some ELK settings are given in the article, it is mostly related to the architecture [3,7,8] or examples [8] [9, p.114]. This is largely due to the fact that ELK has detailed documentation, including settings, which allows users to deploy this software independently [10]. ...
Chapter
Full-text available
This paper presents approaches for preparing different types of data to be loaded into the document-oriented NoSQL Elasticsearch database. The considered database allows not only to store data, but also provides an opportunity to use Kibana, data visualization utility, which is a powerful tool for data analysis. The task of preprocessing is essential, because well-prepared data not only allows you to increase the accuracy of the analysis, but also expand its capabilities. For more coverage, the approaches are described with the use of real cases that have been solved by analysts. The paper presents methodological and practical ways to solve problems both by transforming the data and adding new fields, and by correctly mapping for Elasticsearch indexes. For a clear demonstration of the approaches, their practical application is given on the example of two datasets with bibliographic information on papers and information on funding of scientific and technical projects. The demonstration shows the difference between initial and enriched data, as well as the charts built by working with the data, which enables advanced data analysis.
... Journal of Cloud Computing (2024) 13:107 data from users' IoT devices (e.g., vehicles or gaming devices) [14][15][16], social network applications can suggest nearby activities, businesses, or friends [17]. Thus, diversity of platforms and online data brought by IoT and MEC applications presents huge potential for improving cross-platform applications, such as analysis of social network structure [18][19][20], cross-domain topic detection [21], and multi-layer rumor influence minimization [22,23]. These applications are hungry for the comprehensive amalgamation of user data which is from diverse social networks [24,25]. ...
Article
Full-text available
The Internet of Things (IoT) devices spawn growing diverse social platforms and online data at the network edge, propelling the development of cross-platform applications. To integrate cross-platform data, user identity linkage is envisioned as a promising technique by detecting whether different accounts from multiple social networks belong to the same identity. The profile and social relationship information of IoT users may be inconsistent, which deteriorates the reliability of the effectiveness of identity linkage. To this end, we propose a topic and knowledge-enhanced model for edge-enabled IoT user identity linkage across social networks, named TKM, which conducts feature representation of user generated contents from both post-level and account-level for identity linkage. Specifically, a topic-enhanced method is designed to extract features at the post-level. Meanwhile, we develop an external knowledge-based Siamese neural network for user-generated content alignment at the account-level. Finally, we show the superiority of TKM over existing methods on two real-world datasets. The results demonstrate the improvement in prediction and retrieval performance achieved by utilizing both post-level and account-level representation for identity linkage across social networks.
... In our prior research, we harnessed Elasticsearch as a robust backend infrastructure tailored for handling extensive time series datasets [6]. Elasticsearch's sophisticated capabilities enable swift execution of near real-time queries and pivotal aggregation operations essential for managing voluminous multivariate time series [25,26]. The transition from tabular to Elasticsearch's document-oriented structure simplifies data organization and retrieval processes, fostering enhanced scalability and performance. ...
Preprint
Full-text available
Multimodal time series data are pervasive across various applications, providing detailed insights into the evolution of dynamic and complex systems with high-dimensional, high-resolution information. Analyzing the statistical characteristics, detecting changes, and uncovering unexpected behaviors over time from these longitudinal data can yield valuable insights. Traditional anomaly detection methods that rely solely on automated algorithms often overlook the context-specific nature of anomalies. To address this challenge, we introduce Anomalyzer , a novel visual interface for anomaly analysis with multimodal time series data at scale. Anomalyzer integrates sequential transformations to extract, refine, and analyze data representations crucial for anomaly analysis in complex multimodal time series data. Our approach offers a simple yet powerful workflow, a purposeful and step-by-step process meticulously crafted to guide users through the identification and analysis of anomalies with precision and clarity. We evaluate the performance of Anomalyzer with a synthetic multi-variate time series dataset, demonstrating the effectiveness of our novel approach in identifying and analyzing anomalies. The preliminary results have shown that Anomalyzer can help users to perform time series visualization and anomaly detection efficiently using its visualization, aggregation, and anomaly detection capabilities.
... Current research on RDB full-text search is mainly implemented by search engines such as ElasticSearch [13][14][15][16][17][18]. Recently, ElasticSearch is one of the most popular full-text search tools and is a common means to achieve RDB full-text search [19]. The main design ideas of the ElasticSearch-based full-text search framework are as follows. ...
Article
Full-text available
In order to overcome the inefficiency and resource consumption of full-text search in relational databases, a light full-text search model with auxiliary cache is developed. Specially, we utilize the MySQL as the data storage layer and the Redis as the index cache layer. We first design a full-index cache mechanism by the Redis-based inverted indexes construction methods to augment the efficient memory processing capability of relational databases. In addition, an increment-index synchronization mechanism is implemented to fit the dynamic update of relation database. For hot data, an index update optimization mechanism is provided to guarantee the fast response and accuracy of full-text search. The proposed Redis-based auxiliary cache method has also been put into practical industrial applications and achieved promising results. Finally, we evaluate our method from index space occupation, time consumption and the accuracy of retrieval results. The experimental results show that the proposed model outperforms MySQL Full-Text method 2–3 times and surpasses ElasticSearch 12 times in space resource consumption.
... Once the quality model was defined, a software system, see Figure 2, was designed to provide an implementation to the required set of metrics and to automatically gather information, calculate quality and the index, and retrieve scientific documents. For the indexing and retrieval process, we used the ElasticSearch platform, an open-source text search and analysis server that offers stable and reliable real-time retrieval services [51]. Additionally, ElasticSearch facilitates high-speed document stream processing and indexing [52]. ...
Article
Full-text available
The lack of quality in scientific documents affects how documents can be retrieved depending on a user query. Existing search tools for scientific documentation usually retrieve a vast number of documents, of which only a small fraction proves relevant to the user’s query. However, these documents do not always appear at the top of the retrieval process output. This is mainly due to the substantial volume of continuously generated information, which complicates the search and access not properly considering all metadata and content. Regarding document content, the way in which the author structures it and the way the user formulates the query can lead to linguistic differences, potentially resulting in issues of ambiguity between the vocabulary employed by authors and users. In this context, our research aims to address the challenge of evaluating the machine-processing quality of scientific documentation and measure its influence on the processes of indexing and information retrieval. To achieve this objective, we propose a set of indicators and metrics for the construction of the evaluation model. This set of quality indicators have been grouped into three main areas based on the principles of Open Science: accessibility, content, and reproducibility. In this sense, quality is defined as the value that determines whether a document meets the requirements to be retrieved successfully. To prioritize the different indicators, a hierarchical analysis process (AHP) has been carried out with the participation of three referees, obtaining as a result a set of nine weighted indicators. Furthermore, a method to implement the quality model has been designed to support the automatic evaluation of quality and perform the indexing and retrieval process. The impact of quality in the retrieval process has been validated through a case study comprising 120 scientific documents from the field of the computer science discipline and 25 queries, obtaining as a result 21% high, 39% low, and 40% moderate quality.
Chapter
Full-text available
News outlets increasingly create and disseminate news content specifically for social media platforms such as Instagram and TikTok. Academia has thus far not given this trend the attention it deserves, with empirical and mostly conceptual contributions lacking. This chapter fills the established gap in scholarship by proposing an analytical framework that identifies social media journalism as the production, diffusion, and consumption of social media platform-bound news content. It is to be seen as the successor of print, broadcast, and digital journalism, although the chapter also argues that all four waves are occurring simultaneously rather than consecutively. In turn, this poses issues for four distinguished actors: media practitioners, media users, media regulators and media researchers.
Chapter
Today digital recording technology empowers us to understand real-world behaviors with high quality and high definition multimodal time series data. Making the presentation of these time series fit for analysis purpose, at the right scale and resolution, has become a leading data visualization challenge. In this paper, we present TimeXplore, a novel visual analysis tool to aid the exploration of time series at scale. TimeXplore allows one to query and navigate large volumes of time series and their aggregates in near real time, with a simple yet powerful interface. The visualization synchronized across modalities can provide still further capability for us to develop and verify our hypothesis in multimodal data analysis.
Article
Full-text available
Este artículo presenta un servicio de clasificación documental que permite a los sistemas de gestión documental de múltiples clientes brindar una mayor confianza y credibilidad sobre los tipos documentales asignados a los documentos que cargan los usuarios. La investigación fue realizada a través de las fases de CRISP-DM en las que se evaluaron dos modelos de representación de documentos, bolsas de palabras con n-gramas acumulativos y BERT (propuesto recientemente por Google), y cinco técnicas de aprendizaje de máquina, perceptrón multicapa, bosques aleatorios, k vecinos más cercanos, árboles de decisión y un clasificador bayesiano ingenuo. Los experimentos se realizaron con datos de dos organizaciones y los mejores resultados fueron los obtenidos por el perceptrón multicapa, los bosques aleatorios y los k vecinos más cercanos, con resultados muy similares de exactitud general y recuerdo por clase para los tres algoritmos. Los resultados no son concluyentes para ofertar el servicio a múltiples clientes con un solo modelo, ya que esto depende de los documentos y tipos documentales de cada uno de ellos. Por lo anterior, se ofrece un servicio basado en una arquitectura de microservicios que permite a cada organización la creación de su propio modelo, el monitoreo de su rendimiento en producción y su actualización cuando el rendimiento no sea adecuado.
Conference Paper
Full-text available
World wide road traffic fatality and accident rates are high, and this is true even in technologically advanced countries like the USA. Despite the advances in Intelligent Transportation Systems, safe transportation routing i.e., finding safest routes is largely an overlooked paradigm. In recent years, large amount of traffic data has been produced by people, Internet of Vehicles and Internet of Things (IoT). Also, thanks to advances in cloud computing and proliferation of mobile communication technologies, it is now possible to perform analysis on vast amount of generated data (crowd sourced) and deliver the result back to users in real time. This paper proposes SafeRNet, a safe route computation framework which takes advantage of these technologies to analyze streaming traffic data and historical data to effectively infer safe routes and deliver them back to users in real time. SafeRNet utilizes Bayesian network to formulate safe route model. Furthermore, a case study is presented to demonstrate the effectiveness of our approach using real traffic data. SafeRNet intends to improve drivers safety in a modern technology rich transportation system.
Conference Paper
Full-text available
The search for Trendsetters in social networks turned to be a complex research topic that has gained much attention. The work here presented uses big data analytics to find who better spreads the word in a social network and is innovative in their choices. The analysis on the Yelp platform can be divided in three parts: first, we justify the use of Tips frequency as a variable to profile business popularity. Second we analyze Tips frequency to select businesses that fit a growing popularity profile. And third we graph mine the sociographs generated by the users that interacted with each selected business. Top nodes are ranked by using Indegree, Eigenvector centrality, Pagerank and a Trendsetter algorithms, and we compare the relative performance of each algorithm. Our findings indicate that the Trendsetter ranking algorithm is the most performant at finding nodes that best reflect the Trendsetter properties.
Article
Full-text available
Background Next generation sequencing (NGS) provides a key technology for deciphering the genetic underpinnings of human diseases. Typical NGS analyses of a patient depict tens of thousands non-reference coding variants, but only one or very few are expected to be significant for the relevant disorder. In a filtering stage, one employs family segregation, rarity in the population, predicted protein impact and evolutionary conservation as a means for shortening the variation list. However, narrowing down further towards culprit disease genes usually entails laborious seeking of gene-phenotype relationships, consulting numerous separate databases. Thus, a major challenge is to transition from the few hundred shortlisted genes to the most viable disease-causing candidates. Results We describe a novel tool, VarElect (http://ve.genecards.org), a comprehensive phenotype-dependent variant/gene prioritizer, based on the widely-used GeneCards, which helps rapidly identify causal mutations with extensive evidence. The GeneCards suite offers an effective and speedy alternative, whereby >120 gene-centric automatically-mined data sources are jointly available for the task. VarElect cashes on this wealth of information, as well as on GeneCards’ powerful free-text Boolean search and scoring capabilities, proficiently matching variant-containing genes to submitted disease/symptom keywords. The tool also leverages the rich disease and pathway information of MalaCards, the human disease database, and PathCards, the unified pathway (SuperPaths) database, both within the GeneCards Suite. The VarElect algorithm infers direct as well as indirect links between genes and phenotypes, the latter benefitting from GeneCards’ diverse gene-to-gene data links in GenesLikeMe. Finally, our tool offers an extensive gene-phenotype evidence portrayal (“MiniCards”) and hyperlinks to the parent databases. Conclusions We demonstrate that VarElect compares favorably with several often-used NGS phenotyping tools, thus providing a robust facility for ranking genes, pointing out their likelihood to be related to a patient’s disease. VarElect’s capacity to automatically process numerous NGS cases, either in stand-alone format or in VCF-analyzer mode (TGex and VarAnnot), is indispensable for emerging clinical projects that involve thousands of whole exome/genome NGS analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2722-2) contains supplementary material, which is available to authorized users.
Article
Full-text available
Retailers routinely use association mining to investigate trends in the use of their products. In the medical world, association mining is mostly used to identify associations between symptoms and diseases, or between drugs and adverse events. In comparison, there is a relative paucity of work that focuses on relationships between drugs exclusively. In this work, we use the Medical expenditure panel survey to examine relationships between drugs in the United States. In addition to examining the rules generated by association mining, we introduce the notion of a target drug network and demonstrate via different drugs that it can offer additional medical insight. For example, we were able to find drugs that are commonly taken together despite containing the same active compound. Future work can expand on the concept of target drug network, for example, by annotating the networks with the compounds and intended uses of each drug, to yield additional insight for pharmacosurveillance as well as pharmaceutical companies.
Article
Full-text available
With the development of Social Networking Service (SNS), The recent trend of geospatial research is to obtain insights from SNS data. Among SNS, Twitter, unstructured data of geospatial big data, only provides Open API, thus enables everyone to analyze and to obtain insights on one’s topic of interest in real-time and low-cost. Twitter data have limitation that location prediction is difficult due to lack of location information. Numerous researches on location prediction of Twitter data have been done in various fields internationally, yet very limited number of researches has been conducted domestically. The reasons of scarce domestic researches are as follow: National researchers are still new to Twitter-related data gathering and morphological analysis, and they have difficulty understanding due to lack of related information. The purpose of this research is to provide future research directions for geospatial field researchers who wish to employ SNS data, by researching Twitter location prediction research trend, technology, and methodology.
Conference Paper
Designsafe is a web portal focused on helping Natural Hazards Engineering to conduct research. Natural Hazards Research spans across multiple physical locations, where the experiments take place, and multiple disciplines. Sharing and searching data is an imperative feature when doing research in multiple physical locations. We are able to handle the researchers needs by using a distributed database (Elasticsearch) to index important features extracted from data. In this paper, we will explain the problems we encountered when trying to facilitate sharing and searching of data as well as how we solve these problem with the help of Elasticsearch.
Article
Mayo Clinic (MC) healthcare generates a large number of HL7 V2 messages – 0.7-1.1 million on weekends and 1.7-2.2 million on business days at present. With multiple RDBMS-based systems, such a large volume of HL7 messages still cannot be real-time or near-real-time stored, analyzed and retrieved for enterprise-level clinic and non-clinic usage. To determine if Big Data technology coupled with ElasticSearch technology can satisfy MC daily healthcare needs for HL7 message processing, a BigData platform was developed to contain two identical Hadoop clusters (TDH1.3.2 version) – each containing an ElasticSearch cluster and instances of a storm topology – MayoTopology for processing HL7 messages on MC ESB queues into an ElasticSearch index and the HDFS. The implemented BigData platform can process 62±4 million HL7 messages per day while the ElasticSearch index can provide ultra-fast free-text searching at a speed level of 0.2-second per query on an index containing a dataset of 25 million HL7-derived-JSON-documents. The results suggest that the implemented BigData platform exceeds MC enterprise-level patient-care needs.
Conference Paper
In the last years, with the increase of the available data from social networks and the rise of big data technologies, social data has emerged as one of the most profitable market for companies to increase their benefits. Besides, social computation scientists see such data as a vast ocean of information to study modern human societies. Nowadays, enterprises and researchers are developing their own mining tools in house, or they are outsourcing their social media mining needs to specialised companies with its consequent economical cost. In this paper, we present the first cloud computing service to facilitate the deployment of social media analytics applications to allow data practitioners to use social mining tools as a service. The main advantage of this service is the possibility to run different queries at the same time and combine their results in real time. Additionally, we also introduce twearch, a prototype to develop twitter mining algorithms as services in the cloud.
Article
Interests in manufacturing process management and analysis are increasing, but it is difficult to conduct process analysis due to the increase of manufacturing data. Therefore, we suggest a manufacturing data analysis system that collects event logs from so-called big data and analyzes the collected logs with process mining. There are two kinds of big data generated from manufacturing processes, structured data and unstructured data. Usually, manufacturing process analysis is conducted by using only structured data, however the proposed system uses both structured and unstructured data for enhancing the process analysis results. The system automatically discovers a process model and conducts various performance analysis on the manufacturing processes.