ArticlePDF Available

A framework for social media data analytics using Elasticsearch and Kibana

April 2022
Wireless Networks 28(5)

April 2022
28(5)

DOI:10.1007/s11276-018-01896-2

Authors:

Neel Shah

Lakehead University Thunder Bay Campus

Vijay Mago

Lakehead University Thunder Bay Campus

Real-time online data processing is quickly becoming an essential tool in the analysis of social media for political trends, advertising, public health awareness programs and policy making. Traditionally, processes associated with offline analysis are productive and efficient only when the data collection is a one-time process. Currently, cutting edge research requires real-time data analysis that comes with a set of challenges, particularly the efficiency of continuous data fetching within the context of present NoSQL and relational databases. In this paper, we demonstrate a solution to effectively adsress the challenges of real-time analysis using a configurable Elasticsearch search engine. We are using a distributed database architecture, pre-build indexing and standardizing the Elasticsearch framework for large scale text mining. The results from the query engine are visulized in almost real-time.

Framework for real-time analysis using Elasticsearch

…

Elasticsearch cluster architecture hosted on the HPC at Lakehead University

…

Real-time analysis of Twitter data for the term “pizza”

…

Partial view of the Kibana dashboard for the twitter analysis

…

Figures - available from: Wireless Networks

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Neel Shah

Content may be subject to copyright.

A framework for social media data analytics using Elasticsearch

and Kibana

Neel Shah

•Darryl Willick

•Vijay Mago

Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract

Real-time online data processing is quickly becoming an essential tool in the analysis of social media for political trends,

advertising, public health awareness programs and policy making. Traditionally, processes associated with ofﬂine analysis

are productive and efﬁcient only when the data collection is a one-time process. Currently, cutting edge research requires

real-time data analysis that comes with a set of challenges, particularly the efﬁciency of continuous data fetching within the

context of present NoSQL and relational databases. In this paper, we demonstrate a solution to effectively adsress the

challenges of real-time analysis using a conﬁgurable Elasticsearch search engine. We are using a distributed database

architecture, pre-build indexing and standardizing the Elasticsearch framework for large scale text mining. The results from

the query engine are visulized in almost real-time.

Keywords Social media Big data Real-time analysis Elasticsearch Visualization

1 Introduction

The exponential growth of online data poses a signiﬁcant

challenge in the process of fetching a representative data

set that can be translated into tangible results [1,2]. Pre-

processing in real-time adds another layer of complexity,

especially when the data is textual and unstructured [3]or

crowd sourced [4]. Solutions to processing big data sets in

the ﬁelds of cloud computing and storage are growing at

rapid speed, but when we consider big data on a scale of

petabytes [5], cloud based analytics are limited by network

inefﬁciencies for transporting the data; and recurring costs

for the computational resources required to perform anal-

ysis in real-time [6]. Access and privacy also pose a

challenge in cloud based storage as server administrators

maintain the rights to view both the data and its ﬂow.

Security solutions such as encrypted searching are not

feasible to implement speciﬁc to real-time analysis because

of computational limitations [7]. Currently, the top three

tools used for analyzing large databases are Elasticsearch,

Hadoop and Spark [8]. Elasticsearch is a distributed search

and analytical engine which allows for real-time data

transformations, search queries, document stream pro-

cessing and indexing at a relatively high speed. Addition-

ally, Elasticsearch can index numbers, geographical

coordinates, dates and almost any datatype while support-

ing multiple languages (i.e., Python, Java, Ruby). The

speed of the Elasticsearch engine is founded on its ability

to perform aggregation, searching and processing the index

of the data [9]. Hadoop is a distributed batch computing

platform, using the MapReduce algorithm, that includes

data extraction and transformation capabilities. While the

platform is based on NoSQL technology that makes

uploading unstructured data easy, its query processing

HBASE does not have advanced analytical search capa-

bilities like Elasticsearch. Elasticsearch is a text search and

analytics tool with a visualization plugin for real-time

analysis with an open source license. Finally, Elasticsearch

hosts plugins for Hadoop and Spark to reduce the distance

between the two different technologies and allows for a

hybrid system to be implemented [10].

&Vijay Mago

vmago@lakeheadu.ca

Neel Shah

nshah5@lakeheadu.ca

Darryl Willick

dwillic1@lakeheadu.ca

Deparment of Computer Science, Lakehead University,

Thunder Bay, ON, Canada

123

Wireless Networks

https://doi.org/10.1007/s11276-018-01896-2(0123456789().,-volV)(0123456789().,-volV)

Tools that support the management of large data sets

and real-time data fetching include relational (MySQL,

Oracle Database, SQLite), Graph (Neo4j, Oracle Spatial)

and NoSQL (MongoDB, IBM Domino, Apache CouchDB).

Limiting factors related to all types of databases include

lack of support for full-text searches in real-time. While

NoSQL is functional for full text searching it lacks relia-

bility when compared to relational database models [3].

Traditional databases require that the data is ﬁrst uploaded

and then the administrator must actively decide which data

should be indexed which adds one more layer of processing

making it infeasible for real-time analysis. Elasticsearch

provides a solution to these limiting factors [3] by pro-

viding a highly efﬁcient data fetching and real-time anal-

ysis system that:

•Performs pre-indexing before storing the data to avoid

the need to fetch and query speciﬁc data in real-time;

•Requires limited resources and computing power in

relation to traditional solutions; and

•Provides a system that is distributed and easy to scale.

The capacity for Elasticsearch to contribute to high efﬁ-

ciency, real-time data analysis is enhanced through a

standardized conﬁguration process, shard size management

and standardizing the data before upload into Elasticsearch

and demonstrated through a discussion of both the working

architecture as well as a real-time visualization of social

media data collected during December 2017 and May

2018, a repository of over 1 billion twitter data points.

1.1 Key contributions

•Optimizing and standardizing twitter data for

Elasticsearch

•Creating a conﬁguration ﬁle and choosing the optimal

shard size

•Demonstrating the real-time visualization of a very

large scale social media data set

2 Architecture for real-time analysis

and storage

2.1 Elasticsearch

Elasticsearch was started in the year 2004 as an open

source project called compass, which was based on Apache

Lucene [11]. Elasticsearch is a distributed and scalable

full-text search engine written in Java that is stable and

platform independent. These features combined with

requirement speciﬁc ﬂexibility and easy expansion options

are helpful for real-time big data analysis [12]. We will

discuss some of the general functions of Elasticsearch to

provide context for the Elasticsearch conﬁgurization and

data standardization and shard management procedure

resulting from this research.

2.2 Abstract view

Figure 1illustrates the framework for real-time analysis of

very large scale data based on Elasticsearch and Kibana

[13]. In the ﬁrst step, the Twitter API is used for scraping

twitter data (approximately 1400 tweets per minute) that is

stored in a MongoDB database, which is installed on a

Network Attached Storage (NAS) with a capacity of 16TB.

The twitter data is transfered to preprocssing units which

handle the data and transfer it to High Performance Com-

puting (HPC) infrastructure in almost real-time. As tradi-

tional databases, including MongoDB, are not efﬁcient

enough to handle real-time query, we transfer the pro-

cessing and analsis of data to Elasticsearch, which is

implemented via HPC lab resources. Before uploading the

data, we standardize the twitter object for Elasticsearch and

use multithreading to upload the data for better real-time

performance and to shorten the gap between receiving and

processing data. When a user needs any data, a query will

be sent to Elasticsearch using the Kibana front-end. Elas-

ticsearch processes that query and sends the query result

object (JSON format) to Kibana, where Kibana shows the

query object to the user.

Within the general functioning of the search engine,

Elasticsearch uses a running instance called a node which

can take on one or more roles including a master or a data

node (see Sect. 2.1, Fig. 2). Dataset clusters within Elas-

ticsearch require at least one master and one data node,

however it is possible that a cluster can consist of a single

node since a node may take on multiple roles. The only

data storage format compatible with Elasticsearch is JSON

and therefore requires data mapping for producing func-

tional analysis and visualizations due to the unstructured

format of the twitter data. We observed that reliance on the

JSON format makes the system more ﬂexible than MySQL

and other RDBMS, but less than MongoDB. While a tra-

ditional database such as RDBMS use tables to store the

data, MongoDB uses BSON (like JSON) format, and

Elasticsearch uses an inverted index via the Apache Lucene

architecture to store the data [11]. A typical index in

Elasticsearch is a collection of documents with different

properties that have been organized through user deﬁned

mapping that outlines document types and ﬁelds for dif-

ferent data sources; similar to a table in an SQL database.

The index is then split into shards housed in multiple nodes

where a shard is part of an index distributed on different

nodes. Within the Elasticsearch framework, the inverted

index allows a more categorical storage of big data sets

Wireless Networks

123

within nodes and shards so that real-time search queries are

more efﬁcient. Elasticsearch uses RESTful API to com-

municate with users, see Table 1for a basic architecture

comparison. Additionally, there are different libraries such

as Elasticsearch in Python [14] and Java [15] for better

integration.

2.2.1 Backbone

While Elasticsearch is a powerful tool, a model is required

to optimize functionality for the purpose of real-time big

data analysis speciﬁc to social media. The purpose of this

research is to provide (1) a speciﬁc conﬁguration ﬁle to

optimize the organization of the data set, (2) an optimized

shard size for maximum efﬁciency in storage and pro-

cessing, and (3) a standardized structure for data ﬁelds

present within Twitter to eliminate over-processing of

irrelevant information When the data is stored in Elastic-

search, it stores the data in an index ﬁrst, and then the index

data is stored as an inverted-index using an automatic

tokenizer. When we search in Elasticsearch, we get a

‘snapshot’ of the data, which means that Elasticsearch does

not require the hosting of actual content but instead links to

documents stored within a node to provide a result through

Fig. 1 Framework for real-time analysis using Elasticsearch

Fig. 2 Elasticsearch cluster architecture hosted on the HPC at Lakehead University

Table 1 Comparison between

Elasticsearch and RDBMS basic

architecture

Elasticsearch RDBMS

Index Database

Mapping Table

Document Tuple

Wireless Networks

123

the inverted index. These results are not real data but a

representation of the query’s linkages to all associated

documents stored in each node. As a component of this

project, the following conﬁguration ﬁle was developed and

can be replicated in Elasticsearch on any HPC by editing

the conﬁg ﬁles as per number of nodes and capacity of

server. Table 2describes the basic conﬁguration ﬁle for

Elasticsearch.

Here, the name of a cluster is dslab and a cluster name is

necessary, even if only a single node is present. As the

Elasticsearch is a scattered database, where one or many

nodes work as heads and others as data, this parameter is

used to interconnect all the nodes in the cluster. We can

create numerous clusters with the same hardware using

different instances of Elasticsearch and different conﬁgu-

ration ﬁles.

Table 3is an example of a conﬁguration ﬁle features for

any Elasticsearch node. In every node for the distributed

Elasticsearch we have to conﬁgure the same ﬁle in each

and every instance. When the data is stored we use the

index to store a speciﬁc type of data similar to a dataset in

MySQL. The performance of Elasticsearch is based on the

mapping of the index and how we size the shards of the

data set. The formula to decide the size of the shards is

given in Eq. 1.

Number of shards ¼ðSize of index in GBÞ=50 ð1Þ

The reason behind the consideration of using 50 GB as a

shard size is due to the architecture in Elasticsearch. The

architecture supports 32 GB index size and 32 GB cache

memory so ideally the shard’s memory should be less than

64 GB and through experimentation we observed that the

best results are achieved at shard size of 50 GB.

2.3 Kibana: visualization

In addition to Elasticsearch being efﬁcient for real-time

analysis, extended plugins such as kibana [13] and

Logstash [16] make it convenient for functional represen-

tations of big data in real-time. It is part of the elastic stack

and is freely available under open source license. Kibana

has multiple standard visualizations available by default

and simpliﬁes the process of developing visualizations for

end users with a drag and drop feature. As Kibana is

backed by the Elasticsearch architecture, it functions

quickly and is efﬁcient enough for real-time analysis.

Finally it provides the opportunity for graphical interaction

in the process of building and handling queries with an

accessible visualization of the cluster health and properties

within the database.

3 Social media data analysis

3.1 Configuration of the Elasticsearch

Live social media streaming data is stored in elastic clus-

ters. Each elastic cluster contains 6 nodes, with each node

having 2 threads and 12 GB of memory. Within these 6

nodes one node works as a master and the remaining 5

work as data nodes. Architecture of the elastic cluster is

shown in Fig. 2.

3.2 Social media dataset

We used Elasticsearch to analyze 250?million out of 1

billion tweets scraped between December 2017 and May

2018 using the Twitter API. Since the Twitter API response

is in JSON format and contains unstructured and incon-

sistent data the sequential collection of all data ﬁelds

within the tweet JSON object is not guaranteed. Stan-

dardization of the data and conversion into a structured

format is therefore necessary for Elasticsearch mapping so

that each ﬁeld of data is present when loaded into the

index. To optimize the Elasticsearch we changed the

storage format of the tweet so that all the data is required to

Table 2 Master and data node

conﬁguration ﬁle Master node conﬁg ﬁle Data node conﬁg ﬁle

cluster.name: dsla cluster.name: dslab

node.name: m1 node.name: d1

node.master: true node.master: false

node.data: true node.data: true

path.data: /data/nshah5/dataset path.data: /data/nshah5/dataset

path.logs: /data/nshah5/log path.logs: /data/nshah5/log

network.host: x.x.x.x network.host: x.x.x.x

network.bind_host: 0 network.bind_host: 0

network.publish_host: x.x.x.x network.publish_host: x.x.x.x

discovery.zen.ping.unicast.hosts: [‘‘x.x.x.x’’] discovery.zen.ping.unicast.hosts: [‘‘x.x.x.x’’]

bootstrap.system_call_ﬁlter: false bootstrap.system_call_ﬁlter: false

Wireless Networks

123

be at depth level one in JSON format. Table 4depicts the

basic example of restructured data in Elasticsearch.

As we mentioned previously, the data is stored as an

inverted index that is optimized for text searches and

therefore very efﬁcient. For example, if we search for the

keyword ‘‘pizza’’ within the context of all tweets (250?

millions) in Elasticsearch, the time taken is 4060 ms

(4.06 s) to ﬁnd a total of 192,118 tweets where the ‘‘pizza’’

keyword is present in tweet text. Table 5shows the

example of the keyword ‘‘pizza’’ text search query

response from Elasticsearch. Figure 3a shows a pie chart of

tweets mapping the geographical distribution by nation of

‘‘pizza’’ tweets where the United States alone is responsi-

ble for 47% of total tweets and other countries excluding

the top ﬁve are 30%, which is 77% of total tweets. Addi-

tionally, the visualization shows the time taken to perform

the query is 13 ms (0.013 s). Figure 3b shows ﬁve most

used languages in the tweet text related to ‘‘pizza’’ where

the English language is used in more than 77% tweets

while Spanish is used 12%, Portuguese at third spot with

6%, French at 3% and Japanese at 2% tweets. In this

instance Elasticsearch took 17 ms for query processing.

Figure 3c shows the devices used to tweet with 38% of

tweets coming from the iPhone twitter app, the Android

twitter app was used for 29%, twitter web clients were used

for only 11% and Twitter lite and Tweetdeck combined

were used for around 7%. Other sources were indicated for

the remaining 15% tweets. This query took 11 ms to exe-

cute, which is quite reasonable given the structure and

amount of data.

The above results demonstrate the efﬁciency of this data

analysis system in that all three tasks (fetching the data,

performing descriptive analysis and creating graphs), were

accomplished in less than 15 s from a database size of

250?million tweets. Clearly, this framework has proven

Table 3 Elasticsearch node conﬁguration ﬁle features

Conﬁg ﬁle properties Explanation

cluster.name It is the name of cluster where present node will join.

node.name It gives the name of your current node

node.master The role of master-eligible is decided based on true or false function (Boolean function). The master node manages the

overall state of the cluster including node monitoring, index creation and deletion, and shard to node assignments.

node.data The role of data is decided based on true or false function (Boolean function). It stores the physical data shards, performs

reads, writes, searches and aggregations. Any node can be master and data, both or individual.

path.data The location of the actual data in present node is represented.

path.logs Location where the logs of the present nodes are stored. Logs are important to diagnose problems and monitor working

status.

network.host It’s an address of the present node which is unique for the individual node in the cluster.

network.publish_host It’s a public address where other nodes communicate with the present node.

Table 4 Difference between normal and updated structure

Original tweet structure Updated structure

{{

‘‘??Tweet’’:{ ‘‘Id’’:

‘‘User’’??:{ ‘‘Name’’:

‘‘Id’’??: ...

‘‘Name’’??: }

}

...

}

Table 5 Search query result of ‘‘pizza’’ keyword

Result of keyword ‘‘pizza’’ from all tweets from database

{

‘‘took’’: 4060,

‘‘timed out’’: false,

‘‘shards’’: {

‘‘total’’: 106,

‘‘successful’’: 106,

‘‘skipped’’: 0,

‘‘failed’’: 0

}

‘‘hits’’: {

‘‘total’’: 192118,

‘‘max_score’’: 15.110959,

’’hits’’: [???]

}

Wireless Networks

123

suitable for the analysis of large text data in real-time

without losing accuracy. It also shows that the restructuring

and standardization procedures used on the data assisted in

optimizing the accuracy of the results and efﬁciency of the

processes in a context with limited resources.

3.3 Visualization dashboard

At present, the monitoring framework described in this

paper is used to display data coming from Twitter stream.

For example, in Fig. 4we show a snapshot of the Kibana

dashboard. The top-most plot is a pie chart of tweet source,

which displays the results from which device they use to

tweet, such as iPhone, web browser etc. The second top-

most plot is pie chart of the languages used to tweet. In the

middle, ﬁrst histogram shows the the time and amount of

twitter data ﬂow. And, the second shows the word cloud

and the bottom left shows the top ten users who are actively

twitting. Similar dynamic dashboard creation is possible in

minutes without knowledge of any programming knowl-

edge and back-end system understanding.

3.4 Limitation

As Elasticsearch is designed to be used for real-time

analysis, there are databases which provide functions that

perform better in ofﬂine mass data analysis such as NoSQL

databases (e.g., MongoDB) that support MapReduce [3].

Elasticsearch does not support MapReduce as it instead

relies on the inverted index [17]. Additionally, Elastic-

search can be slow when new data is added to the index and

it currently lacks support for more popular data formats

(e.g., XML, CSV) and only supports JSON format which

can be challenging for users unfamiliar with JSON [18].

4 Related work

Marcos [6] suggests that cloud computing is elastic in

nature as the user can adjust it as per his/her data needs

from processing power to storage. While it does seem ideal

in theory, cloud computing comes with several challenges

including both network inefﬁciency in data transport as

well as issues related to data privacy and access control.

Additionally, Hashem refers to ‘data stabbing’, which are

problems associated with storing and analyzing the

heterogenous and complex structure of big datasets [19].

As a solution, other authors such as Oleksii [3] support and

highlight the beneﬁts of Elasticsearch as a tool for real-time

analysis in modern data mining repositories. In this

research we attempted to address and resolve problems

associated with data preprocessing and efﬁciency while

also discussing the elastic cluster framework in more depth.

Fig. 3 Real-time analysis of Twitter data for the term ‘‘pizza’’

Wireless Networks

123

Currently there are very few research studies on frame-

works for big data analysis in real-time although several

discuss the application of practices in manufacturing [20]

and gene coding [21]. Some researchers have used Elas-

ticsearch cluster via a logstash plugin and MySQL data-

bases for heterogenous accounting information system

[22]. The data is monitored using MySQL server before

inserting it into Elasticsearch. The researchers observed

that there might be an issue of duplication of data and

storage space, but the architecture ensures ﬂexibility and

modularity for the monitoring the system. They choose

Elasticsearch as text search engine in real-time which

allows them to search historical data. Mayo Clinic

healthcare system developed a big data hybrid system

using Hadoop and Elasticsearch technology. In healthcare,

real-time result is essential for effective decision making.

Before that, they used traditional RDBMS database to store

and process data. But, it lacks integration between different

platforms and inability to querying/ingest of healthcare

data in a real-time or near real-time. In Mayo Clinic system

Hadoop is used as a distributed ﬁle system and on top of it

Elasticsearch works as a real-time text search engine.

When there is a need for raw data Hadoop is used, and for

real-time analysis Elasticsearch is used. Their experimen-

tation showed very promising results, like searching 25.2

million HL7 records took just 0.21 s [23].

Designsafe web portal by Natural Hazards Engineering

Research(NHER) analyze and share experimental data in

real-time with researchers across the world. The user of

their system sends the large amount of data which is stored

in distributed NFS. During the preprocessing of the data,

which includes analysis of string and basic cleaning, they

index the data and make it compatible for Elasticsearch.

This model allows users in a different location to query the

same experimental data which is computed in different part

of the world in real-time. All these present environments

needs to be correctly conﬁgured as per the data and the

requirements [24].

5 Conclusion

Elasticsearch provides a functional system to store, pre-

index, search and query very large scale data in real-time.

In particular, the capability of expanding the cluster size

without stopping service as per user’s requirement makes it

suitable for this application. This research provides insights

on how to standardize and conﬁgure the processes of

Elasticsearch which result in increased analysis efﬁciency.

To demonstrate the functionality and interactivity for users,

the Kibana plugin was used as an interface. In conclusion, a

proper conﬁguration of Elasticsearch and Kibana makes

real-time analysis of large scale data efﬁcient and can help

policy makers see the results instantaneously and in an

accessible format that allows for decision making.

Acknowledgements This research is funded by the NSERC Discov-

ery Grant; computing resources are provided by the High Perfor-

mance Computing (HPC) Lab and Department of Computer Science

Fig. 4 Partial view of the Kibana dashboard for the twitter analysis

Wireless Networks

123

at Lakehead University, Canada. Authors are grateful to Gaurav

Sharma for initially setting up the data collection stream, Salimur

Choudhury for providing insight on the data analysis and Andrew

Heppner for reviewing and editing drafts.

References

1. Cervellini, P., Menezes, A. G., & Mago, V. K. (2016). Finding

trendsetters on yelp dataset. In 2016 IEEE symposium series on

computational intelligence (SSCI) (pp. 1–7). IEEE.

2. Belyi, E., Giabbanelli, P. J., Patel, I., Balabhadrapathruni, N. H.,

Abdallah, A. B., Hameed, W., et al. (2016). Combining associ-

ation rule mining and network analysis for pharmacosurveillance.

The Journal of Supercomputing,72(5), 2014–2034.

3. Kononenko, O., Baysal, O., Holmes, R., & Godfrey, M. W.

(2014). Mining modern repositories with Elasticsearch. In Pro-

ceedings of the 11th working conference on mining software

repositories (pp. 328–331). ACM.

4. Liu, Q., Kumar, S., & Mago, V. (2017). Safernet: Safe trans-

portation routing in the era of internet of vehicles and mobile

crowd sensing. In 2017 14th IEEE annual consumer communi-

cations and networking conference (CCNC) (pp. 299–304). IEEE.

5. Kim, M. G., & Koh, J. H. (2016). Recent research trends for

geospatial information explored by twitter data. Spatial Infor-

mation Research,24(2), 65–73.

6. Assunc¸a

˜o, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., &

Buyya, R. (2015). Big data computing and clouds: Trends and

future directions. Journal of Parallel and Distributed Computing,

79, 3–15.

7. Bsch, C., Hartel, P., Jonker, W., & Peter, A. (2014). A survey of

provably secure searchable encryption. ACM Computing Surveys,

47(2), 18:1–18:51. https://doi.org/10.1145/2636328.

8. Kumar, P., Kumar, P., Zaidi, N., & Rathore, V. S. (2018).

Analysis and comparative exploration of elastic search, Mongodb

and Hadoop big data processing. In Soft computing: Theories and

applications, (pp. 605–615). New York: Springer.

9. Cea, D., Nin, J., Tous, R., Torres, J., & Ayguade

´, E (2014).

Towards the cloudiﬁcation of the social networks analytics. In

Modeling decisions for artiﬁcial intelligence (pp. 192–203). New

York: Springer.

10. Bai, J. (2013). Feasibility analysis of big log data real time search

based on hbase and elasticsearch. In 2013 ninth international

conference on natural computation (ICNC) (pp. 1166–1170).

IEEE.

11. Elasticsearch-elastic.co. Retrieved April 30, 2018, from https://

www.elastic.co/guide/en/elasticsearch/reference/6.2/index.html.

12. Gormley, C., & Tong, Z. (2015). Elasticsearch: The deﬁnitive

guide: A distributed real-time search and analytics engine.

Sebastopol: O’Reilly Media, Inc.

13. Your Window into the Elastic Stack. Retrieved 30, 2018, from

https://www.elastic.co/products/kibana.

14. Python Elasticsearch Client. Retrieved April 30, 2018, from

https://elasticsearch-py.readthedocs.io/en/master/.

15. Java Elasticsearch library-Elastic. Retrieved April 30, 2018, from

https://www.elastic.co/guide/en/Elasticsearch/client/java-api/6.2/

index.html.

16. Getting Started with Logstash. Retrieved April 30, 2018, from

https://www.elastic.co/guide/en/logstash/current/getting-started-

with-logstash.html.

17. Yang, F., Tschetter, E., Le

´aute

´, X., Ray, N., Merlino, G., &

Ganguli, D. (2014). Druid: A real-time analytical data store. In

Proceedings of the 2014 ACM SIGMOD international conference

on Management of data (pp. 157–168). ACM.

18. Burkitt, K. J., Dowling, E. G., & Branon, T. R. (2014). System

and method for real-time processing, storage, indexing, and

delivery of segmented video. US Patent 8,769,576.

19. Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A.,

& Khan, S. U. (2015). The rise of big data on cloud computing:

Review and open research issues. Information Systems,47,

98–115.

20. Yang, H., Park, M., Cho, M., Song, M., & Kim, S. (2014). A

system architecture for manufacturing process analysis based on

big data and process mining techniques. In 2014 IEEE interna-

tional conference on big data (pp. 1024–1029). IEEE.

21. Stelzer, G., Plaschkes, I., Oz-Levi, D., Alkelai, A., Olender, T.,

Zimmerman, S., et al. (2016). Varelect: The phenotype-based

variation prioritizer of the genecards suite. BMC Genomics,

17(2), 444.

22. Bagnasco, S., Berzano, D., Guarise, A., Lusso, S., Masera, M., &

Vallero, S. (2015). Monitoring of IAAS and scientiﬁc applica-

tions on the cloud using the elasticsearch ecosystem. In Journal

of physics: Conference series (Vol. 608, p. 012016). Bristol: IOP

Publishing.

23. Chen, D., Chen, Y., Brownlow, B. N., Kanjamala, P. P., Arre-

dondo, C. A. G., Radspinner, B. L., et al. (2017). Real-time or

near real-time persisting daily healthcare data into hdfs and

elasticsearch index inside a big data platform. IEEE Transactions

on Industrial Informatics,13(2), 595–606.

24. Coronel, J. B., & Mock, S. (2017). Designsafe: Using elastic-

search to share and search data on a science web portal. In

Proceedings of the practice and experience in advanced research

computing 2017 on sustainability, success and impact (p. 25).

ACM.

Neel Shah is a graduate student

at Lakehead University, Canada

Currently, he is working on

analyzing social media data to

gain insight of Canadian healthy

behaviours. He is an active open

source coder and maintains two

open-source Python libraries.

His core areas of interest are

deep learning and data science.

Darryl Willick received the B.Sc.

(1988) and M.Sc. (1990)

degrees in Computational Sci-

ence from the University of

Saskatchewan, Canada.

Throughout his career he has

worked in the areas of High

Performance Computing, Visu-

alization, System administra-

tion, and Cyber Security.

Currently he is a Technology

Security Specialist/HPCC Ana-

lyst at Lakehead University,

Canada.

Wireless Networks

123

Vijay Mago is also an Associate

Professor in the Department of

Computer Science at Lakehead

University in Ontario, Canada

where he teaches and conducts

research in areas including big

data analytics, machine learn-

ing, natural language process-

ing, artiﬁcial intelligence,

medical decision making and

Bayesian intelligence. He

received his Ph.D. in Computer

Science from Panjab University,

India in 2010. In 2011 he joined

the Modelling of Complex

Social Systems program at the IRMACS Centre of Simon Fraser

University. He has served on the program committees of many

international conferences and workshops. Recently in 2017, he joined

Technical Investment Strategy Advisory Committee Meeting for

Compute Ontario. He has published extensively (more than 50 peer

reviewed articles) on new methodologies based on soft computing and

artiﬁcial intelligent techniques to tackle complex systemic problems

such as homelessness, obesity, and crime. He currently serves as an

associate editor for IEEE Access and BMC Medical Informatics and

Decision Making and as co-editor for the Journal of Intelligent

Systems.

Wireless Networks

123

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Wireless Networks

This content is subject to copyright. Terms and conditions apply.

Data Preparation for Advanced Data Analysis on Elastic Stack

Chapter

Full-text available

Feb 2024

This paper presents approaches for preparing different types of data to be loaded into the document-oriented NoSQL Elasticsearch database. The considered database allows not only to store data, but also provides an opportunity to use Kibana, data visualization utility, which is a powerful tool for data analysis. The task of preprocessing is essential, because well-prepared data not only allows you to increase the accuracy of the analysis, but also expand its capabilities. For more coverage, the approaches are described with the use of real cases that have been solved by analysts. The paper presents methodological and practical ways to solve problems both by transforming the data and adding new fields, and by correctly mapping for Elasticsearch indexes. For a clear demonstration of the approaches, their practical application is given on the example of two datasets with bibliographic information on papers and information on funding of scientific and technical projects. The demonstration shows the difference between initial and enriched data, as well as the charts built by working with the data, which enables advanced data analysis.

Topic and knowledge-enhanced modeling for edge-enabled IoT user identity linkage across social networks

Article

Full-text available

May 2024

The Internet of Things (IoT) devices spawn growing diverse social platforms and online data at the network edge, propelling the development of cross-platform applications. To integrate cross-platform data, user identity linkage is envisioned as a promising technique by detecting whether different accounts from multiple social networks belong to the same identity. The profile and social relationship information of IoT users may be inconsistent, which deteriorates the reliability of the effectiveness of identity linkage. To this end, we propose a topic and knowledge-enhanced model for edge-enabled IoT user identity linkage across social networks, named TKM, which conducts feature representation of user generated contents from both post-level and account-level for identity linkage. Specifically, a topic-enhanced method is designed to extract features at the post-level. Meanwhile, we develop an external knowledge-based Siamese neural network for user-generated content alignment at the account-level. Finally, we show the superiority of TKM over existing methods on two real-world datasets. The results demonstrate the improvement in prediction and retrieval performance achieved by utilizing both post-level and account-level representation for identity linkage across social networks.

Anomalyzer: A Visual Interface for Multimodal Time Series Anomaly Analysis

Preprint

Full-text available

Apr 2024

Multimodal time series data are pervasive across various applications, providing detailed insights into the evolution of dynamic and complex systems with high-dimensional, high-resolution information. Analyzing the statistical characteristics, detecting changes, and uncovering unexpected behaviors over time from these longitudinal data can yield valuable insights. Traditional anomaly detection methods that rely solely on automated algorithms often overlook the context-specific nature of anomalies. To address this challenge, we introduce Anomalyzer , a novel visual interface for anomaly analysis with multimodal time series data at scale. Anomalyzer integrates sequential transformations to extract, refine, and analyze data representations crucial for anomaly analysis in complex multimodal time series data. Our approach offers a simple yet powerful workflow, a purposeful and step-by-step process meticulously crafted to guide users through the identification and analysis of anomalies with precision and clarity. We evaluate the performance of Anomalyzer with a synthetic multi-variate time series dataset, demonstrating the effectiveness of our novel approach in identifying and analyzing anomalies. The preliminary results have shown that Anomalyzer can help users to perform time series visualization and anomaly detection efficiently using its visualization, aggregation, and anomaly detection capabilities.

Redis-based full-text search extensions for relational databases

Article

Full-text available

Apr 2024

In order to overcome the inefficiency and resource consumption of full-text search in relational databases, a light full-text search model with auxiliary cache is developed. Specially, we utilize the MySQL as the data storage layer and the Redis as the index cache layer. We first design a full-index cache mechanism by the Redis-based inverted indexes construction methods to augment the efficient memory processing capability of relational databases. In addition, an increment-index synchronization mechanism is implemented to fit the dynamic update of relation database. For hot data, an index update optimization mechanism is provided to guarantee the fast response and accuracy of full-text search. The proposed Redis-based auxiliary cache method has also been put into practical industrial applications and achieved promising results. Finally, we evaluate our method from index space occupation, time consumption and the accuracy of retrieval results. The experimental results show that the proposed model outperforms MySQL Full-Text method 2–3 times and surpasses ElasticSearch 12 times in space resource consumption.

Toward a Model to Evaluate Machine-Processing Quality in Scientific Documentation and Its Impact on Information Retrieval

Article

Full-text available

Dec 2023

The lack of quality in scientific documents affects how documents can be retrieved depending on a user query. Existing search tools for scientific documentation usually retrieve a vast number of documents, of which only a small fraction proves relevant to the user’s query. However, these documents do not always appear at the top of the retrieval process output. This is mainly due to the substantial volume of continuously generated information, which complicates the search and access not properly considering all metadata and content. Regarding document content, the way in which the author structures it and the way the user formulates the query can lead to linguistic differences, potentially resulting in issues of ambiguity between the vocabulary employed by authors and users. In this context, our research aims to address the challenge of evaluating the machine-processing quality of scientific documentation and measure its influence on the processes of indexing and information retrieval. To achieve this objective, we propose a set of indicators and metrics for the construction of the evaluation model. This set of quality indicators have been grouped into three main areas based on the principles of Open Science: accessibility, content, and reproducibility. In this sense, quality is defined as the value that determines whether a document meets the requirements to be retrieved successfully. To prioritize the different indicators, a hierarchical analysis process (AHP) has been carried out with the participation of three referees, obtaining as a result a set of nine weighted indicators. Furthermore, a method to implement the quality model has been designed to support the automatic evaluation of quality and perform the indexing and retrieval process. The impact of quality in the retrieval process has been validated through a case study comprising 120 scientific documents from the field of the computer science discipline and 25 queries, obtaining as a result 21% high, 39% low, and 40% moderate quality.

From Newspapers to TikTok: Social Media Journalism as the Fourth Wave of News Production, Diffusion and Consumption

Chapter

Full-text available

Dec 2023

Jonathan Hendrickx

News outlets increasingly create and disseminate news content specifically for social media platforms such as Instagram and TikTok. Academia has thus far not given this trend the attention it deserves, with empirical and mostly conceptual contributions lacking. This chapter fills the established gap in scholarship by proposing an analytical framework that identifies social media journalism as the production, diffusion, and consumption of social media platform-bound news content. It is to be seen as the successor of print, broadcast, and digital journalism, although the chapter also argues that all four waves are occurring simultaneously rather than consecutively. In turn, this poses issues for four distinguished actors: media practitioners, media users, media regulators and media researchers.

Visualizing Multimodal Time Series at Scale

Chapter

Dec 2023

Today digital recording technology empowers us to understand real-world behaviors with high quality and high definition multimodal time series data. Making the presentation of these time series fit for analysis purpose, at the right scale and resolution, has become a leading data visualization challenge. In this paper, we present TimeXplore, a novel visual analysis tool to aid the exploration of time series at scale. TimeXplore allows one to query and navigate large volumes of time series and their aggregates in near real time, with a simple yet powerful interface. The visualization synchronized across modalities can provide still further capability for us to develop and verify our hypothesis in multimodal data analysis.

A Highly Accurate Data Synchronization and Full-text Search Algorithm for Canal and Elasticsearch

Conference Paper

Oct 2023

Servicio de clasificación documental multi cliente basado en técnicas de aprendizaje de máquina y Elasticsearch / Multi-Client Document Classification Service Based on Machine Learning Techniques and Elasticsearch / Serviço de classificação documentária multi-cliente baseado em técnicas de aprendizagem de máquina e Elasticsearch

Article

Full-text available

Jan 2022

Este artículo presenta un servicio de clasificación documental que permite a los sistemas de gestión documental de múltiples clientes brindar una mayor confianza y credibilidad sobre los tipos documentales asignados a los documentos que cargan los usuarios. La investigación fue realizada a través de las fases de CRISP-DM en las que se evaluaron dos modelos de representación de documentos, bolsas de palabras con n-gramas acumulativos y BERT (propuesto recientemente por Google), y cinco técnicas de aprendizaje de máquina, perceptrón multicapa, bosques aleatorios, k vecinos más cercanos, árboles de decisión y un clasificador bayesiano ingenuo. Los experimentos se realizaron con datos de dos organizaciones y los mejores resultados fueron los obtenidos por el perceptrón multicapa, los bosques aleatorios y los k vecinos más cercanos, con resultados muy similares de exactitud general y recuerdo por clase para los tres algoritmos. Los resultados no son concluyentes para ofertar el servicio a múltiples clientes con un solo modelo, ya que esto depende de los documentos y tipos documentales de cada uno de ellos. Por lo anterior, se ofrece un servicio basado en una arquitectura de microservicios que permite a cada organización la creación de su propio modelo, el monitoreo de su rendimiento en producción y su actualización cuando el rendimiento no sea adecuado.

Research on real-time log analysis system based on elastic stack and Flink

Conference Paper

Aug 2023

SafeRNet: Safe Transportation Routing in the era of Internet of Vehicles and Mobile Crowd Sensing

Conference Paper

Full-text available

Jan 2017

World wide road traffic fatality and accident rates are high, and this is true even in technologically advanced countries like the USA. Despite the advances in Intelligent Transportation Systems, safe transportation routing i.e., finding safest routes is largely an overlooked paradigm. In recent years, large amount of traffic data has been produced by people, Internet of Vehicles and Internet of Things (IoT). Also, thanks to advances in cloud computing and proliferation of mobile communication technologies, it is now possible to perform analysis on vast amount of generated data (crowd sourced) and deliver the result back to users in real time. This paper proposes SafeRNet, a safe route computation framework which takes advantage of these technologies to analyze streaming traffic data and historical data to effectively infer safe routes and deliver them back to users in real time. SafeRNet utilizes Bayesian network to formulate safe route model. Furthermore, a case study is presented to demonstrate the effectiveness of our approach using real traffic data. SafeRNet intends to improve drivers safety in a modern technology rich transportation system.

Finding Trendsetters on Yelp Dataset

Conference Paper

Full-text available

Dec 2016

The search for Trendsetters in social networks turned to be a complex research topic that has gained much attention. The work here presented uses big data analytics to find who better spreads the word in a social network and is innovative in their choices. The analysis on the Yelp platform can be divided in three parts: first, we justify the use of Tips frequency as a variable to profile business popularity. Second we analyze Tips frequency to select businesses that fit a growing popularity profile. And third we graph mine the sociographs generated by the users that interacted with each selected business. Top nodes are ranked by using Indegree, Eigenvector centrality, Pagerank and a Trendsetter algorithms, and we compare the relative performance of each algorithm. Our findings indicate that the Trendsetter ranking algorithm is the most performant at finding nodes that best reflect the Trendsetter properties.

VarElect: The phenotype-based variation prioritizer of the GeneCards Suite

Article

Full-text available

Jun 2016
BMC GENOMICS

Background Next generation sequencing (NGS) provides a key technology for deciphering the genetic underpinnings of human diseases. Typical NGS analyses of a patient depict tens of thousands non-reference coding variants, but only one or very few are expected to be significant for the relevant disorder. In a filtering stage, one employs family segregation, rarity in the population, predicted protein impact and evolutionary conservation as a means for shortening the variation list. However, narrowing down further towards culprit disease genes usually entails laborious seeking of gene-phenotype relationships, consulting numerous separate databases. Thus, a major challenge is to transition from the few hundred shortlisted genes to the most viable disease-causing candidates. Results We describe a novel tool, VarElect (http://ve.genecards.org), a comprehensive phenotype-dependent variant/gene prioritizer, based on the widely-used GeneCards, which helps rapidly identify causal mutations with extensive evidence. The GeneCards suite offers an effective and speedy alternative, whereby >120 gene-centric automatically-mined data sources are jointly available for the task. VarElect cashes on this wealth of information, as well as on GeneCards’ powerful free-text Boolean search and scoring capabilities, proficiently matching variant-containing genes to submitted disease/symptom keywords. The tool also leverages the rich disease and pathway information of MalaCards, the human disease database, and PathCards, the unified pathway (SuperPaths) database, both within the GeneCards Suite. The VarElect algorithm infers direct as well as indirect links between genes and phenotypes, the latter benefitting from GeneCards’ diverse gene-to-gene data links in GenesLikeMe. Finally, our tool offers an extensive gene-phenotype evidence portrayal (“MiniCards”) and hyperlinks to the parent databases. Conclusions We demonstrate that VarElect compares favorably with several often-used NGS phenotyping tools, thus providing a robust facility for ranking genes, pointing out their likelihood to be related to a patient’s disease. VarElect’s capacity to automatically process numerous NGS cases, either in stand-alone format or in VCF-analyzer mode (TGex and VarAnnot), is indispensable for emerging clinical projects that involve thousands of whole exome/genome NGS analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2722-2) contains supplementary material, which is available to authorized users.

Combining association rule mining and network analysis for pharmacosurveillance

Article

Full-text available

May 2016
J SUPERCOMPUT

Retailers routinely use association mining to investigate trends in the use of their products. In the medical world, association mining is mostly used to identify associations between symptoms and diseases, or between drugs and adverse events. In comparison, there is a relative paucity of work that focuses on relationships between drugs exclusively. In this work, we use the Medical expenditure panel survey to examine relationships between drugs in the United States. In addition to examining the rules generated by association mining, we introduce the notion of a target drug network and demonstrate via different drugs that it can offer additional medical insight. For example, we were able to find drugs that are commonly taken together despite containing the same active compound. Future work can expand on the concept of target drug network, for example, by annotating the networks with the compounds and intended uses of each drug, to yield additional insight for pharmacosurveillance as well as pharmaceutical companies.

Recent research trends for geospatial information explored by Twitter data

Article

Full-text available

Mar 2016

With the development of Social Networking Service (SNS), The recent trend of geospatial research is to obtain insights from SNS data. Among SNS, Twitter, unstructured data of geospatial big data, only provides Open API, thus enables everyone to analyze and to obtain insights on one’s topic of interest in real-time and low-cost. Twitter data have limitation that location prediction is difficult due to lack of location information. Numerous researches on location prediction of Twitter data have been done in various fields internationally, yet very limited number of researches has been conducted domestically. The reasons of scarce domestic researches are as follow: National researchers are still new to Twitter-related data gathering and morphological analysis, and they have difficulty understanding due to lack of related information. The purpose of this research is to provide future research directions for geospatial field researchers who wish to employ SNS data, by researching Twitter location prediction research trend, technology, and methodology.

Analysis and Comparative Exploration of Elastic Search, MongoDB and Hadoop Big Data Processing

Chapter

Jan 2018

DesignSafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Conference Paper

Jul 2017

Designsafe is a web portal focused on helping Natural Hazards Engineering to conduct research. Natural Hazards Research spans across multiple physical locations, where the experiments take place, and multiple disciplines. Sharing and searching data is an imperative feature when doing research in multiple physical locations. We are able to handle the researchers needs by using a distributed database (Elasticsearch) to index important features extracted from data. In this paper, we will explain the problems we encountered when trying to facilitate sharing and searching of data as well as how we solve these problem with the help of Elasticsearch.

Real-Time or Near Real-Time Persisting Daily Healthcare Data into HDFS and ElasticSearch Index inside a Big Data Platform

Article

Dec 2016

Mayo Clinic (MC) healthcare generates a large number of HL7 V2 messages – 0.7-1.1 million on weekends and 1.7-2.2 million on business days at present. With multiple RDBMS-based systems, such a large volume of HL7 messages still cannot be real-time or near-real-time stored, analyzed and retrieved for enterprise-level clinic and non-clinic usage. To determine if Big Data technology coupled with ElasticSearch technology can satisfy MC daily healthcare needs for HL7 message processing, a BigData platform was developed to contain two identical Hadoop clusters (TDH1.3.2 version) – each containing an ElasticSearch cluster and instances of a storm topology – MayoTopology for processing HL7 messages on MC ESB queues into an ElasticSearch index and the HDFS. The implemented BigData platform can process 62±4 million HL7 messages per day while the ElasticSearch index can provide ultra-fast free-text searching at a speed level of 0.2-second per query on an index containing a dataset of 25 million HL7-derived-JSON-documents. The results suggest that the implemented BigData platform exceeds MC enterprise-level patient-care needs.

Towards the Cloudification of the Social Networks Analytics

Conference Paper

Oct 2014

In the last years, with the increase of the available data from social networks and the rise of big data technologies, social data has emerged as one of the most profitable market for companies to increase their benefits. Besides, social computation scientists see such data as a vast ocean of information to study modern human societies. Nowadays, enterprises and researchers are developing their own mining tools in house, or they are outsourcing their social media mining needs to specialised companies with its consequent economical cost. In this paper, we present the first cloud computing service to facilitate the deployment of social media analytics applications to allow data practitioners to use social mining tools as a service. The main advantage of this service is the possibility to run different queries at the same time and combine their results in real time. Additionally, we also introduce twearch, a prototype to develop twitter mining algorithms as services in the cloud.

A system architecture for manufacturing process analysis based on big data and process mining techniques

Article

Jan 2015

Interests in manufacturing process management and analysis are increasing, but it is difficult to conduct process analysis due to the increase of manufacturing data. Therefore, we suggest a manufacturing data analysis system that collects event logs from so-called big data and analyzes the collected logs with process mining. There are two kinds of big data generated from manufacturing processes, structured data and unstructured data. Usually, manufacturing process analysis is conducted by using only structured data, however the proposed system uses both structured and unstructured data for enhancing the process analysis results. The system automatically discovers a process model and conducts various performance analysis on the manufacturing processes.

A framework for social media data analytics using Elasticsearch and Kibana

Abstract and Figures

Recommended publications

Improving Parallelism in Data-Intensive Workflows with Distributed Databases

Performance of elasticsearch in cloud environment with nGram and non-nGram indexing

Review of Elasticsearch Performance Variating the Indexing Methods

Evaluating Riak Key Value Cluster for Big Data

Evaluating the tools to analyze the data from the ParticipACT Brazil Project: A test with Elasticsea...