ArticlePDF Available

Social Media-Based Surveillance Systems for Healthcare using Machine Learning

Authors:

Abstract and Figures

One of the most popular domains that have caught the attention of researchers is real-time surveillance in the health and informatics segment. Many initiatives have been discovered due to this real-time surveillance surrounding public health informatics. Real-time surveillance in the health and informatics field has used the information from social media to predict the outbreak of diseases as well as to look after the diseases. There is no doubt in the fact that the availability of the data from social media in the recent past, especially the data from Twitter, has offered the researchers real-time syndromic surveillance in making quick analyses and conclusions in investigating the disease outbreak. The paper will get to know about the recent work of machine learning trends and text classification that has been utilized by the surveillance system by using the data from social media in the field of healthcare. Apart from this, the paper has also discussed the various limitations and challenges by taking into account the future direction that can be considered in this domain further.
Content may be subject to copyright.
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
21
Abstract One of the most popular domains that have caught
the attention of researchers is real-time surveillance in the
health and informatics segment. Many initiatives have been
discovered due to this real-time surveillance surrounding public
health informatics. Real-time surveillance in the health and
informatics field has used the information from social media to
predict the outbreak of diseases as well as to look after the
diseases. There is no doubt in the fact that the availability of the
data from social media in the recent past, especially the data
from Twitter, has offered the researchers real-time syndromic
surveillance in making quick analyses and conclusions in
investigating the disease outbreak. The paper will get to know
about the recent work of machine learning trends and text
classification that has been utilized by the surveillance system by
using the data from social media in the field of healthcare. Apart
from this, the paper has also discussed the various limitations
and challenges by taking into account the future direction that
can be considered in this domain further.
Keywords Disease Prediction, Health Prediction,
Instagram, Machine Learning, Outbreak, Social Media,
Surveillance Systems, Twitter.
I. INTRODUCTION
To enhance public health surveillance, the use of health
information available on the internet has been seen as an
opportunity. The health and surveillance system has always
depended on the established system of mandatory as well as
voluntary reporting of infectious diseases by the doctors in
the laboratories [1]. As of now, social media data enables
direct access to the data that would help in the surveillance
epidemiology used to monitor the various public health
threats like new diseases or pandemic-related early-level
warnings. However, whether data from social media and the
internet would help analyze potential public health threats
remains a question. The paper has highlighted that there is a
wide-reaching application in the field of public health
surveillance in this century with the challenges of utilizing
the imaging surveillance system for the sake of infectious
disease epidemiology such as the specific resource needed,
the technical essentials, and the acceptance of the public
health practitioner as well as a policymaker [2].
There are many machine learning algorithms such as the
deep neural network, Naive Bayes, Multinomial Naive
Bayes, etc. have been utilized and proposed for the sake of
the epidemic prediction classification approach at the time of
looking at the surveillance system in the health informatics
dormant [3].
Submitted on October 10, 2022.
Published on November 07, 2022.
C. Singh, VIT, Australia.
(e-mail: Chetanpal.singh vit.edu.au)
A. Research Objective
In the paper, one will get to deal with the latest trend based
on social media in the field of healthcare. Apart from this, the
overview of the machine learning algorithm used for
monitoring the data is also discussed in the paper [3].
Listed below are the Research questions that you should go
through.
RQ1: what type of machine learning has become popular
among the authors of the various research papers at the time
of developing social media-based surveillance systems in the
field of the health sector?
RQ2: what are some of the most popular social media data
that are used for civilians in the field of healthcare domain?
RQ3: what is the implementation of a social media-based
surveillance system in the field of health informatics?
RQ4: whether there are any challenges experienced by the
syndromic surveillance system by the inclusion of the data
from social media?
B. Research Motivation
One upon reading this paper will get to have a look at the
following contribution that was not there in the previous
paper. The concept of article selection query taken from
different digital library databases for choosing the relevant
article is there in this paper [4]. The research paper has
discussed the overview of the popular machine learning
classification algorithm related to social media-based
surveillance systems in the field of healthcare. The paper has
also conducted its statistical analysis on the social media
platform as well as health topics that have been studied by the
particular articles [5].
C. Research Gap
Social media data has a major role to play in making
healthcare decisions even though social media has a lot of
usage in other sectors. To reinforce the capability of the
traditional syndromic surveillance system and the early
detection of the disease and immediate public health
response, there is a need for new approaches and
technologies. To review the different surveillance systems
that utilize social media data, many research papers have been
prepared. All these papers have successfully covered the
various data sources, technologies, algorithms, application,
and evaluations. According to a recent review of the
surveillance system in the field of health informatics using
social media, the researchers have not been much impressed
with the development [2]. The paper will give a complete
R. Thakkar, VIT, Australia.
(e-mail: rahul.thakkar vit.edu.au)
J. Warraich, Holmesglen, Australia.
(e-mail: Jatinder.warraich holmesglen.edu.au).
@
@
@
Social Media-Based Surveillance Systems for Healthcare
using Machine Learning
Dr. Chetanpal Singh, Dr. Rahul Thakkar, and Jatinder Warraich
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
22
analysis of machine learning technology and its approaches
in this field specifically in the recent past. Apart from this one
will also get to know about the challenges and future
directions of it.
II. LITERATURE REVIEW
A. Machine Learning Methods Utilized by Surveillance
Systems to Process Social Media Data
There is no doubt in fact about the growing popularity of
machine learning in the recent past in detecting the various
patterns in images on raw data. According to [6], he has
concluded the progression in machine learning offers
epidemiologists the to mine with the help of a broad set of
digital data. To detect the personal health experience as well
as the deep gramulator approach to enhance decisions at the
time of applying to the independent test set, there has been a
study of several supervised machine learning algorithms by
[7]. Another researcher [8] has provided a detailed analysis
of the conjunction of natural language processing as well as
machine learning with the various platforms of social media
to assist in the analysis of huge datasets for the sake of
population-level mental health research. Moreover, some
architecture is still popular among the various methodological
variations of machine learning.
There is no question of the requirement of labeled data sets
in predicting the output in an unsupervised algorithm like it
was required in a supervised classification algorithm. This is
the reason why the unsupervised classification method is
known to be a more popular alternative in analyzing the text;
however, this method is challenging in achieving the same
percentage of accuracy as a supervised method. The same
thing is seen when [9] provided tweet classification with the
help of supervised and unsupervised methods. [9] has
discussed topic modeling which is known to be one of the
best-supervised techniques that present control over topic
contents in contrast to the old classified specifically when it
is a naturally noisy media channel.
1) Multinomial naive bayes
To classify the Twitter content, one of the most popular
supervised classification approaches was followed which is
known as Multinomial Naive Bayes. [10] has come up with a
real-time allergy surveillance system that has helped in the
classification of tweets as either positive or negative and
when it is the positive tweet it highlights the person or any
other person who is the person that beholds all the allergy
symptoms. If it mentions things such as news, and
advertisements for general awareness of allergies then the
tweet is classified as negative. The author has come to the
conclusion that the Naive Bayes Multinomial model with an
F measure is the best solution for the text classification
performance. On the other hand, [11] has utilized machine
learning methods so that they can classify the tweets based on
personal or news related.
The author also went with classification of personal tweets
into a couple of categories that are negative or neutral tweets.
The NBM has provided the best result and is known to be
better than the other two techniques used. The classifier such
as in Naive Bayes and SVM has not produced satisfactory
results as Naive Bayes Multinomial model [12].
Fig. 1. Architecture of MNB [10], a) Naive bayes supervised approach; b)
Naïve bayes with F features.
TABLE I: NAÏVE BAYES APPROACH OUTCOME
Methods
used
Application
Outcome
Reference
Multinomial
Naive Bayes
Real-time allergy
surveillance system
for classification of
tweets as either
positive or negative
Naive Bayes
Multinomial model
with an F measure
is the best solution
for the text
classification
performance
[10]
Multinomial
Naive Bayes,
Naive Bayes
and SVM
Classification of
tweets based on
personal, or news
related
NBM has provided
the best result
[11], [12]
2) Support vector machine
One upon reading the paper will get to know that the
dependency of input parameter and application is high for the
performance of the classification algorithm, however, when
the classification task was taken into account, SVM is best
suited. [13] with the help of the SVM classification model,
was successfully able to classify sick microblog and non-sick
microblog posts. The author also highlighted the time
consumption by SVM that is required for the classification
task was not affected at the time of arranging the microblog
increment in the consumption of time by KNN in completing
the classification task. According to [13], the best
classification method was known to be an SVM when it was
differentiated from the various other techniques of machine
learning. The SVM method plays a crucial role in classifying
the various data from social media on a range of health issues.
According to [14], the SVM classifier is best suited when
accuracy in the prediction of the class of tweets was taken
into account. At the same time, as per a study by [15], this
algorithm can reach 90% accuracy when the tweets from
social media were segregated as epidemiological and non-
epidemiological.
Fig. 2. Support Vector Machine [13].
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
23
TABLE II: SVM APPROACH OUTCOME
Methods used
Application
Outcome
Reference
SVM
classification
model
Classification of sick
microblog and non-
sick microblog posts
SVM is
best suited
[13]
SVM
classification
model
Accurate prediction of
the class of tweets
SVM
classifier is
best suited
[14]
SVM
classification
model
Segregation of tweets
from social media as
epidemiological and
non-epidemiological.
SVM can
reach to
90%
accuracy
[15]
3) Deep neural network
The convolutional deep neural network has a crucial role
in the classification of text in the field of health. [16] has
utilized the various types of DNN that is the convolutional
neural network as well as bidirectional long short-term
memory by combining machine learning approaches that
would help in the classification of measles-related tweet
classification tasks and the researchers have pointed out that
the convolutional neural network has provided remarkable
result.
Fig. 3. Deep Neural Network [16].
TABLE III: CNN BLSTM APPROACH OUTCOME
Methods used
Application
Outcome
Reference
Convolutional
neural network
(CNN), bidirectional
long short term
memory (BLSTM)
Classification of
measles-related
tweet classification
tasks
CNN has
provided
remarkable
result
[16]
4) Decision tree
For predicting positive as well as negative tweets
surrounding the personal health experience, the decision tree
classifier has played a crucial role and has performed well
[17]. Apart from this, the approach of the decision tree
classifier was also utilized by [18] to differentiate tweets
surrounding the swine flu. [19] has achieved a result that is
average with the help of a decision tree classifier for the sake
of classifying the personal health experience tweets.
1) Logistic regression
Logistic regression is another popular choice; it is used for
data classification tasks among the other classification
algorithms. In a study by [20], Logistic regression towards
updating a record showed better F1 measure and recall as
compared to the SVM in terms of classification of relevant
and irrelevant tweets regarding asthma. Apart from this, the
usage of a maximum entropy classifier is also seen in the
research paper which is significantly utilized for the
classification of text [21]. [17] has also used the logistic
regression classifier for the research. Moreover, the illness
tweets were monitored with the help of the maximum entropy
[22], and another study performed for tweet classification
also used the maximum entropy [23].
Fig. 4. Decision Tree Algorithm [17].
TABLE IV: DECISION TREE CLASSIFIER APPROACH OUTCOME
Methods used
Application
Outcome
Reference
Decision tree
classifier
Predicting positive as
well as negative tweets
surrounding the personal
health experience
Decision Tree
gives good
performance
[17]
Decision tree
classifier
To differentiate tweets
surrounding the swine
flu
Good
performance
by decision
tree
[18]
Decision tree
classifier
For classifying the
personal health
experience tweets
Decision Tree
gives average
result
[19]
Fig. 5. Logistic Regression [20].
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
24
TABLE V: LOGISTIC REGRESSION APPROACH OUTCOME
Methods used
Application
Outcome
Reference
logistic regression
Text
classification
-
[17]
Logistic
regression
For
classification of
relevant and
irrelevant
tweets
regarding
asthma
Logistic
regression
showed better
F1 measure and
recall as
compared to
SVM
[20]
Maximum entropy
classifier
Text
classification
-
[21], [23]
Maximum entropy
classifier
Monitoring
illness tweets
-
[22]
2) Naive bayes
The SVM and Naive Bayes technique was used by [14] for
the classification of data sets into mosquito-borne disease and
the tweets that have been considered for this were further
classified into three classes: symptoms, fear, and prevention
with the help of the same classification. According to [18],
the tweets have been classified so that it can differentiate
swine Flu-related text from the noise of all the tweets that
were not relevant by utilizing various machine learning
techniques like the decision tree and random forest.
Fig. 6. Naive Bayes [24].
TABLE VI: NAÏVE BAYES AND SVM APPROACH OUTCOME
Methods used
Application
Outcome
Reference
SVM and Naive
Bayes
classification of
data sets into
mosquito-borne
Further classification
of tweets was possible
[14]
Naive Bayes,
SVM
To differentiate
swine Flu-
related text
from the noise
of all the tweets
Naive Bias and SVM
has given the best
outcome with a
measure of 0.77 as
compared to decision
tree and random forest
[18]
Naive Bayes
For text
classification
Naïve Bayes gives
average performance
compared to other
classifiers
[24]
Naive Bayes
Dengue
suspected tweet
was marked as
irrelevant or
relevant
Naive Bayes classifier
gave the best
performance
[23]
The author has taken into account all the swine Flu-related
words and identified that the Naive Bias and SVM has given
the best outcome with a measure of 0.77. The classification
algorithm has become popular, and the authors have
considered using this for the text classification that shows the
average performance when it is compared to the various other
classifiers [24]. The best performance was given by the Naive
Bayes classifier when the dengue-suspected tweet was taken
into account and marked as irrelevant or relevant. For this,
various bigrams, emojis, trigrams, and location information
were also considered.[23]
3) Random forest
Social media text classification is utilized by the random
forest approach combined with conventional machine
learning approaches. [25] has experimented with the various
approaches with the naive Bayes classifier and the outcome
of the result came to be that the former is far better than the
Naive Bayes method. The various types of machine learning
approaches were also experimented with to deal with text
mining such as clustering, k means, etc. [13], [24]. The
grounds of similar words were utilized to group the tweet and
the tweet can also be differentiated based on the similarity
measure.
Fig. 7. Random Forest [25].
TABLE VII: RANDOM FOREST, NAÏVE BAYES, RANDOM FOREST AND K
MEANS APPROACH OUTCOME
Methods used
Application
Outcome
Reference
Random Forest
Classifier, Naïve
Bayes
Social media text
classification
Random Forest
gave better
performance
[25]
Random Forest
Classifier,
clustering, k means
Text mining for
differentiating the
tweets
Random Forest
gave better
performance
[13] [24]
4) K nearest neighbor
The utilization of KNN with Naive Bayes, SVM, and
Naive Bayes multinomial was done by [12] to figure out and
monitor messages reporting and discuss various types of
allergies. The author has come up with the conclusion that the
k-NN has better precision as compared to the other
approaches used by the author in identifying and assigning
the tweets whether it is an actual incident of allergy or an
awareness tweet.
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
25
Fig. 8. K-nearest Neighbor [12].
TABLE VIII: KNN WITH NAÏVE BAYES, SVM AND NAÏVE BAYES
APPROACH OUTCOME
Methods used
Application
Outcome
Reference
KNN with Naive
Bayes, SVM,
and Naive Bayes
Multinomial
(NBM)
To figure out and
monitor
messages
reporting and
discuss various
types of allergies
K-NN has better
precision for
identifying tweets
as actual incident
of allergy or
awareness tweets
[12]
B. Various Kinds of Social Media Data Sources for Data
Collection
The number of social media users is increasing with time
and the various social media users share different information
thus the researcher needs to track the significant information
so that they can monitor the various activities in social media
related to public health purposes. Social media also has
exposure to different kinds of topics apart from public health.
There is no doubt in the fact that social media platform is
one of the best platforms where one can get a lot of
information about public health. But some researchers are
still in doubt that the data from social media would play a
crucial role in detecting an outbreak and analyzing the content
of social media for healthcare data [26]. Various posts on
social media and online search behaviour can act as a very
crucial source of information related to health outbreaks.
1) Twitter
One of the most popular microblogging services is Twitter
which has a lot of users who are posting to tweet, and it's
related to various posts that the unregistered user can also
read. Twitter is one of the leading microblogging services has
more than 300 million monthly active users and this is the
reason why the social media platform can be trusted and can
identify the various incidents of diseases in mankind. A set of
seven terms were used by [27] to gather tweets of more than
50000 and study and classify them by analyzing cardiac arrest
and resuscitation. For the sake of surveillance of disease,
some of the factors such as location, volume, time as well as
public perceptions are taken into account [28].
There is recent work that was done by [29] for the various
health organizations where the information was collected
from the social media platform and was used to figure out the
information at the time of the epidemic which has been very
helpful to the various health organizations.
To get the Ebola-related tweets that are considered in 4
topics such as risk factors, prevention, education, disease
trends as well as compassion, the usage of a natural language
processing approach has been considered [30]. [31] has
performed the study to identify the potential of the social
media platform as a new way of sharing information. The
various posts from Twitter are trusted by millions of users and
the media postings are also considered as a fast source to
identify the incidence of diseases in the population and hence
the researchers feel it's important to find an efficient method
so that the health-related tweets can be examined and
processed easily.
Moreover, for the partition vector representation, the
unsupervised method was utilized by [24]. [10] has done
research based on the allergy activities in their collection of
tweets that have allergy-related tweets mentioned in them.
Twitter was also helpful in detecting health problems like
respiratory, gastrointestinal, health-related problems. The
data from Twitter was used to study a variety of public health
issues like allergy, mosquito-borne disease, dengue etcetera
[5].
2) Instagram
In 2010, another popular social media platform was
founded known as Instagram. Instagram is a popular photo
and video sharing platform and since it was founded in 2010
the number of registered users rose to 800 hundred million
[32]. The reason why Instagram has provided satisfactory
results is that it is a photo and video sharing platform and the
data that was there in this platform can be a good source for
the surveillance of the disease [32]. [33] have studied the
Ebola-related social media posts on a couple of social media
platforms is Twitter and Instagram and the outcome has
highlighted that the best platform among the two for
communication and reaching the people at the time of health
crisis is Instagram.
C. Application of Social Media-Based Surveillance System
This section talks about the recent application of the
popular surveillance system in the health informatics domain.
The various recent applications include the prediction of
disease tracking misinformation and global awareness.
Fig. 9. Applications of social media-based surveillance system [3].
Application
of social
media-based
surveillance
system
Global
awareness of
the event
Syndromic
surveillance-
based
disease
prediction
Event-based
surveillance
and disease
prediction
Magnitude
estimation of
disease over
sometime
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
26
1) Global awareness of the event
The surveillance system plays a crucial role in monitoring
general public awareness and also provides perception
regarding health events once the detection of the event has
been completed. The social media platform has user-
generated sentiment regarding the outbreak situation that
talks about the knowledge, attitude, and perception of the
people [34]. The users on social media can share their
sentiments, opinion, and response at the time of the outbreak.
2) Syndromic surveillance-based disease prediction
One of the best tools for predicting the outbreak for public
health purposes with the help of data that is gathered from
different sources is syndromic surveillance. That is all that
was acquired from the tool is targeted to minimize the
exposure of the disease in the population and to take proper
measures and prevent it. The information from social media
has been used in the past few years widely to study disease
incidences and to figure out the outbreak of the disease. The
data that was taken from social media is beneficial for the
officials of public health in detecting the outbreak earlier than
traditional means. Studies have shown that a surveillance
system in the healthcare domain helps to predict diseases for
the concerns of public health and the data of the surveillance
system is in the type of self-reported symptom. For early
warning and outbreak detection, the data from Twitter was
utilized as a tool for predicting the swine Flu, tuberculosis,
Ebola, and syphilis [35]-[38]. In another study, the
examination of disease incidence such as dengue as well as
typhoid fever was taken into account [24].
3) Event-based surveillance and disease prediction
Event-based surveillance is a process where the data is
captured very fast and in a proper manner about the various
events that are at serious risk to the health of the public. The
collection of data can come from diverse internet sources
such as reports from the media, online discussion platforms,
routine reporting systems, personal information or it can even
from rumors. Talking about the web forum contacts, the
definition of an event is defined as excessive news posting.
The importance of the event is proportional to the total
number of postings about the topic. Hence the effect of the
event can be determined on the topic diffusion by taking into
account the total number of postings on the topic. According
to a study by [39], it has been analyzed that the epidemic
surrounding the Zika virus has used Twitter Corpus.
The social media-based public health intelligence
monitoring technique to give the situation awareness of the
various threats related to public health required to assist
surveillance activities has risen remarkably over the last 20
years [5]. [40] has proposed a software system, DEFENDER
that includes potential health events detection functionality to
study the streams of Twitter and then generate the event based
on the output to the users who are in the front end-user
interface.
4) Magnitude estimation of disease over sometime
The magnitude of the issue can be easily determined with
the help of a surveillance system. The estimation of the future
of the various diseases can easily be done by planning the
allocation of resource treatment and prevention [2].
Moreover, surveillance system analysis can play a significant
role in figuring out the disease level over a certain period and
the assessment can be made on behalf of that.
III. METHODS AND MATERIALS
The research aims are to look into surveillance of social
media-based systems by using the technique of machine
learning to forecast illness in real-time or the situation that
arises in the near real-time. The research article selection
criteria were established to include were published in the year
2010 and 2018.
To compile thorough research in a bibliography format the
publications on social media- which is a surveillance-based
system in the area of healthcare, the following scientific
databases were searched: IEEE Xplore, Science Direct,
PubMed, and ACM Digital Library.
Now following and describing the query based in IEEE
Xplore database:
The following query was formed using an advanced search
of the IEEE Xplore database: ((("Abstract": surveillance) OR
"Document Title": leadership OR "Abstract": outbreak))
When filters were applied, 656 items (Journals & Magazines
and conferences) were found.
A. Describing and Running the Query-Based in the ACM
Digital Library System
ACM Digital Library searched for query: record in a
similar way. (((outbreak OR surveillance) OR
acmdlTitle(+surveillance) AND (health* OR illness) 265
articles were found using the search terms; in addition, we
conducted an advanced search of the ScienceDirect database
for the following terms in the title, abstract, and keywords:
(surveillance OR outbreak) AND (health* OR illness) AND
social media. As a result of the search keywords, 75 articles
were retrieved.
Finally, PubMed, which accesses the MEDLINE database,
was utilized to find papers. (surveillance [Title/Abstract]) OR
(((epidemic [Title/Abstract]) AND ((health[Title/Abstract])
OR ((disease[Title/Abstract])) AND ((social
media[Title/Abstract])) AND ((health[Title/Abstract]) OR
((disease[Title/Abstract]) AND ((social
media[Title/Abstract]) AND A total of 1240 articles were
discovered for further research, out of which we acquire
roughly 244 articles.
In addition, they are about our research and concluding the
results and statistics of the same from Google Scholar were
used for the same reflected recent trends. The words
surveillance system, social media, machine learning,
and health informaticswere used in the study. Over time,
this graph clearly illustrates a rise in the number of articles in
healthcare. The use of social media data and the various
machine learning algorithms is the central area of concern
considered in the surveillance system.
Each of the papers found in our research which was around
1240 articles, was separately vetted based on the abstract, and
titles were also considered by each of the paper's authors. We
accepted them for further inquiry if the abstract or header, or
both, explained social media or web-based monitoring;
otherwise, they were dismissed. The second thing we did was
look at the papers that included the algorithm of the machine
learning techniques in their methodology.
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
27
IV. RESULTS AND DISCUSSION
While considering the system of surveillance, which was
based on the data from social media for detection of the
outbreak or a breakeven point or health events that have
improved early identification of epidemics and related events,
other researchers have questioned the effectiveness of these
monitoring systems for the following reasons:
A. Privacy Issues
Issues that arise from getting data from social media
accounts, such as datasets that are private, are obtained for
health purposes utilizing social media. Even though social
media data is publicly available, individuals may not want
their postings or data to be used for the study [41]. Users'
expectations, based on public data and privacy was
considered a significant factor.
B. Verification of the Data Set
An issue with gathering data from social media is that it
must be validated. Standardization, verification, and control
issues may arise if unofficial data from social media is used
[42]. The truth connected with a vast quantity of diverse data
from social media was validated by [43].
The dataset is an essential part of the prediction model. The
dataset has a significant impact on the outcomes of prediction
models of the following:
i. Historical data
ii. Training data
iii. Testing data
All the dataset mentioned above is included in the
prediction models. A considerable quantity of training data is
necessary for testing predictions based on model training to
forecast models.
C. Noise
Noise is one of the most significant issues encountered
during data collection. The information gathered through
social networking sites might include unrelated data to the
goal. Such information on sickness words has no bearing on
one's health. For example, posts featuring the keyword Irish
Flu may trigger a slew of flu-related activity [20].
Unfortunately, there are situations when a user posts a status
and is mistakenly assumed to be infected when they are not.
In this way, false information might impact illness
management in the public health department.
To eliminate such noise in data; nevertheless, additional
training is required to obtain the relevant data for further
analysis.
D. Bias Based on Demographics
Although social media can help collect demographic
information like age, gender, and race. The determination is
complex and managing the algorithm so that public
healthcare efforts are directed in such a direction makes the
task complex.
The research also supports the semi-demographic factor
and excludes those who are not active on social media and the
old children who are least involved in such platforms [44].
One of the few studies [45] looked at the users' profiles
looking at the Facebook comments posted about sex and
discovered that males wrote more posts per person than
women. The majority of social media users are under the age
of thirty. The discovery that social media data is weighted
towards the frequently active user and the data from young
people further supports the bias.
E. Variability in Lexicon and Language
Though communication via social media aids in extracting
healthcare data, it is difficult to evaluate the language
semantically. Due to the informal and imprecise nature of
social media communications, it results in an incomplete
result. This constraint has been studied and adequately
researched by [46].
F. Low Confidence
Low confidence is another issue that occurs when using
social media data. The research presents a conspiracy
concerning the Zika virus pandemic on Reddit during a public
health crisis [47]. And for this, more training is required to
reach and approach the algorithm with the classification
feature.
According to a recent study [48], official websites are a
more reliable source of vaccination information than social
media. The quality of health-related data available on the
internet varies. Many social media data analysis tools may
indicate hyped data that something significant is happening.
However, this might reflect panic rather than actual illness
outbreaks.
Also, users may claim to have the flu when they have a
regular cold, or others may discuss the sickness owing to
heightened media coverage.
V. CONCLUSION
According to the data, Twitter was in the platform which
was most searched. SVM was also the most often utilized
classification approach. Furthermore, when data were
categorized into two classes, SVM was the best classifier.
This research looks at the most recent trends in public health
monitoring systems that use different algorithms. Compared
to traditional methods, it is found that social media-based
surveillance systems outperform them. The paper has also
spoken about how data collected can be further used to
improve monitoring systems in the field of public health.
A. Future Work
The combination of internet data with the current
circumstances such as weather, demographic data, and so on
to improve forecast outcomes.
Combining the factors such as sentiment, comments,
locations, and other input characteristics from user postings
with text material for improved the overall analysis.
Sorting user postings into multiple categories that give
varying weights to each class to increase predicting accuracy
and text analysis of images can be extended. There is a lot of
room for development in topic modeling to generate more
precise findings.
REFERENCES
[1] Du LJ, Tang L. Using a Machine Learning Approach to Monitor
COVID-19 Vaccine Adverse Events (VAE) from Twitter Data.
Vaccines, 2022; 10(103): 1-11.
European Journal of Engineering and Technology Research
ISSN: 2736-576X
DOI: http://dx.doi.org/10.24018/ejeng.2022.7.6.2914
Vol 7 | Issue 6 | November 2022
28
[2] Aiello E, Renson A, Zivich PN. Social Media and Internet-Based
Disease Surveillance for Public Health. Annu. Rev. Public Health,
2020; 41: 101-118.
[3] Gupta, Katarya R. Social media based surveillance systems for
healthcare using machine learning: A systematic review. Journal of
Biomedical Informatics, 2020; 108: 103500.
[4] Hossein Abad ZS, Kline A, Sultana M, Noaeen M. Digital public health
surveillance: a systematic scoping review. NPJ Digital Medicine, 2021;
4(41): 1-13.
[5] Chiolero, Buckeridge D. Glossary for public health surveillance in the
age of data science. Journal of Epidemiology Community Health, 2020;
74(7): 612-616.
[6] Bates M. Tracking Disease: Digital Epidemiology Offers New Promise
in Predicting Outbreaks. IEEE Pulse, 2017; 8: 18-22.
[7] Calix R, Gupta R, Gupta M, Jiang K. Deep gramulator: Improving
precision in the classification of personal health-experience tweets with
deep learning. 2017 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM); 2017.
[8] Mike, Daniel C. Social media, big data, and mental health: current
advances and ethical implications. Current Opinion in Psychology,
2016; 9: 77-82.
[9] Sousa L, de Mello R, Cedrim D, Garcia A, Missier P, Uchôa A, Oliveira
A, Romanovsky A. VazaDengue: An information system for
preventing and combating mosquito-borne diseases with social
networks. Information Systems, 2018; 75: 26-42.
[10] Ji X, Chun S, Geller J. Monitoring public health concerns using twitter
sentiment classifications. 2013 IEEE International Conference on
Healthcare Informatics, IEEE; 2013.
[11] Lee K, Agrawal A, Choudhary A. Mining social media streams to
improve public health allergy surveillance. Proceedings of the 2015
IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, 2015.
[12] Nargund K, Natarajan S. Public health allergy surveillance using
micro-blogs. 2016 Int. Conf. Adv. Comput. Commun. Informatics,
ICACCI; 2016.
[13] Yang N, Cui X, Hu C, Zhu W, Yang C. Chinese social media analysis
for disease surveillance. 2014 International Conference on
Identification, Information and Knowledge in the Internet of Things,
IEEE; 2014.
[14] Jain V, Kumar S. Effective surveillance and predictive mapping of
mosquito-borne diseases using social media. J. Comput. Sci., 2018; 25:
406-415.
[15] Espina K, Regina M, Estuar J. Infodemiology for Syndromic
Surveillance of Dengue and Typhoid Fever in the Philippines. Proc.
Comput. Sci., 2017; 121: 554-561.
[16] Du J, Tang L, Xiang Y, Zhi D, Xu J, Song H, Tao C. Public perception
analysis of tweets during the 2015 measles outbreak: Comparative
study using convolutional neural network models. J. Med. Internet
Res., 2018; 20: 1-11.
[17] Jiang K, Gupta R, Gupta M, Calix R, Bernard G. Identifying Personal
Health Experience Tweets with Deep Neural Networks* HHS Public
Access. 2017 39th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society (EMBC); 2017.
[18] Kumar V, Kumar S. An Effective Approach to Track Levels of
Influenza-A (H1N1) Pandemic in India Using Twitter. Procedia
Computer Science, 2015; 70: 801-807.
[19] Calix R, Gupta R, Gupta M, Jiang K. Deep gramulator: Improving
precision in the classification of personal health-experience tweets with
deep learning. Proc.-2017 IEEE Int. Conf. Bioinforma. Biomed. BIBM;
January 2017.
[20] Korkontzelos, Piliouras D, Dowsey A, Ananiadou S. Boosting drug
named entity recognition using an aggregate classifier. Artif. Intell.
Med., 2015; 65: 145-153.
[21] Zhang W, Ram S, Burkart M, Pengetnze Y. Extracting signals from
social media for chronic disease surveillance. Proceedings of the 6th
International Conference on Digital Health Conference; 2016.
[22] Mowery. Twitter Influenza Surveillance: Quantifying Seasonal
Misdiagnosis Patterns and their Impact on Surveillance Estimates.
Online J Public Heal. Inf., 2016; 8(3).
[23] Nsoesie E, Flor L, Hawkins J, Maharana A, Skotnes T, Marinho F,
Brownstein J. Social media as a sentinel for disease surveillance: what
does sociodemographic status have to do with it? PLoS Currents, 2016;
8.
[24] Dai X, Bikdash M, Meyer B. From social media to public health
surveillance: Word embedding based clustering method for twitter
classification. SoutheastCon 2017, IEEE; 2017.
[25] Sousa, de Mello R, Cedrim D, Garcia A, Missier P, Uchôa A, Oliveira
A, Romanovsky A. VazaDengue: An information system for
preventing and combating mosquito-borne diseases with social
networks. Inf. Syst., 2018; 75: 26-42.
[26] Chaudhary S, Naaz S. Use of big data in computational epidemiology
for public health surveillance. 2017 International Conference on
Computing and Communication Technologies for Smart Nation
(IC3TSN), IEEE; 2017.
[27] Bosley J, Zhao N, Hill S, Shofer F, Asch D, Becker L, Merchant R.
Decoding twitter: Surveillance and trends for cardiac arrest and
resuscitation communication. Resuscitation, 2013; 84: 206-212.
[28] Stefanidis, Vraga E, Lamprianidis G, Radzikowski J, Delamater P,
Jacobsen K, Pfoser D, et al. Zika in Twitter: temporal variations of
locations, actors, and concepts. JMIR public health and surveillance,
2017; 3(2): e6925.
[29] Rudra, Sharma A, Ganguly N, Imran M. Classifying information from
microblogs during epidemics. Proceedings of the 2017 international
conference on digital health; 2017.
[30] Edd, Rn S. What can we learn about the Ebola outbreak from tweets?
Am. J. Infect. Control., 2015; 43: 563-571.
[31] Kwak H, Lee C, Park H, Moon S. What is Twitter, a Social Network or
a News Media? Arch. Zootec., 2011; 11: 297-300.
[32] Systrom K. Strengthening our commitment to safety and kindness for
800 million. Instagram Press, 2017. Accessed 9 March 2022. [Internet].
Available: https://instagram.tumblr.com/post/165759350412/170926-
news.
[33] Guidry J, Jin Y, Orr C, Messner M, Meganck S. Ebola on Instagram
and Twitter: How health organizations address the health crisis in their
social media engagement. Public Relat. Rev., 2017; 43: 477-486.
[34] Tang L, Bie B, Zhi D. Tweeting about measles during stages of an
outbreak: A semantic network approach to the framing of an emerging
infectious disease. American Journal of Infection Control, 2018;
46(12), 13751380. https://doi.org/10.1016/j.ajic.2018.05.019.
[35] Kostkova P, Szomszor M, St. Luis C. #swineflu: The Use of Twitter as
an EarlyWarning and Risk Communication. ACM Transactions on
Management Information Systems, 2014; 5(2), 125.
https://doi.org/10.1145/2597892X.
[36] Zhou, Ye J, Feng Y, Tuberculosis surveillance by analyzing google
trends. IEEE Trans. Biomed. Eng., 2011; 58: 2247-2254.
[37] Yom-Tov E. Ebola data from the Internet: An opportunity for
syndromic surveillance or a news event? Proceedings of the 5th
international conference on digital health; 2015.
[38] Young S, Mercer N, Weiss R, Torrone E, Aral S. Using social media
as a tool to predict syphilis. Prev. Med. (Baltim), 2018; 109: 58-61.
[39] Nolasco D, Oliveira J. Subevents Detection through Topic Modeling in
Social Media Posts. Future Generation Computer Systems, 2018; 93:
290-303.
[40] Thapen, Simmie D, Hankin C, Gillard J. DEFENDER: detecting and
forecasting epidemics using novel data-analytics for enhanced
response. PloS one, 2016; 11(5): e0155417.
[41] Mckee R. Ethical issues in using social media for health and health care
research. Health Policy (New. York), 2013; 110: 298-301.
[42] Blouin-Genest G, Miller A. The politics of participatory epidemiology:
Technologies, social media and influenza surveillance in the US. Heal.
Policy Technol., 2017; 6: 192-197.
[43] Bodnar T, Salathé M. Validating models for disease detection using
twitter. Proceedings of the 22nd International Conference on World
Wide Web; 2013.
[44] Charles-Smith L, Reynolds T, Cameron M, Conway M, Lau E, Olsen
J, Pavlin J, et al. Using social media for actionable disease surveillance
and outbreak management: A systematic literature review. PloS one,
2015; 10(10): e0139701.
[45] Strekalova Y. Emergent health risks and audience information
engagement on social media. Am. J. Infect. Control., 2016; 44: 363-
365.
[46] Limsopatham, Collier N. Towards the semantic interpretation of
personal health messages from social media. Proceedings of the ACM
First International Workshop on Understanding the City with Urban
Informatics; 2015.
[47] Kou Y, Gui X, Chen Y, Pine K. Conspiracy Talk on Social Media:
Collective Sensemaking during a Public Health Crisis. Proc. ACM
Human-Computer Interact, 2017; 1: 1-21.
[48] Cataldi J, Dempsey A, O'Leary S. Measles, the media, and MMR:
Impact of the 201415 measles outbreak. Vaccine, 2016; 34: 6375-
6380.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Social media can be used to monitor the adverse effects of vaccines. The goal of this project is to develop a machine learning and natural language processing approach to identify COVID-19 vaccine adverse events (VAE) from Twitter data. Based on COVID-19 vaccine-related tweets (1 December 2020–1 August 2021), we built a machine learning-based pipeline to identify tweets containing personal experiences with COVID-19 vaccinations and to extract and normalize VAE-related entities, including dose(s); vaccine types (Pfizer, Moderna, and Johnson & Johnson); and symptom(s) from tweets. We further analyzed the extracted VAE data based on the location, time, and frequency. We found that the four most populous states (California, Texas, Florida, and New York) in the US witnessed the most VAE discussions on Twitter. The frequency of Twitter discussions of VAE coincided with the progress of the COVID-19 vaccinations. Sore to touch, fatigue, and headache are the three most common adverse effects of all three COVID-19 vaccines in the US. Our findings demonstrate the feasibility of using social media data to monitor VAEs. To the best of our knowledge, this is the first study to identify COVID-19 vaccine adverse event signals from social media. It can be an excellent supplement to the existing vaccine pharmacovigilance systems.
Article
Full-text available
The ubiquitous and openly accessible information produced by the public on the Internet has sparked an increasing interest in developing digital public health surveillance (DPHS) systems. We conducted a systematic scoping review in accordance with the PRISMA extension for scoping reviews to consolidate and characterize the existing research on DPHS and identify areas for further research. We used Natural Language Processing and content analysis to define the search strings and searched Global Health, Web of Science, PubMed, and Google Scholar from 2005 to January 2020 for peer-reviewed articles on DPHS, with extensive hand searching. Seven hundred fifty-five articles were included in this review. The studies were from 54 countries and utilized 26 digital platforms to study 208 sub-categories of 49 categories associated with 16 public health surveillance (PHS) themes. Most studies were conducted by researchers from the United States (56%, 426) and dominated by communicable diseases-related topics (25%, 187), followed by behavioural risk factors (17%, 131). While this review discusses the potentials of using Internet-based data as an affordable and instantaneous resource for DPHS, it highlights the paucity of longitudinal studies and the methodological and inherent practical limitations underpinning the successful implementation of a DPHS system. Little work studied Internet users’ demographics when developing DPHS systems, and 39% (291) of studies did not stratify their results by geographic region. A clear methodology by which the results of DPHS can be linked to public health action has yet to be established, as only six (0.8%) studies deployed their system into a PHS context.
Article
Full-text available
Public health surveillance is the ongoing systematic collection, analysis and interpretation of data, closely integrated with the timely dissemination of the resulting information to those responsible for preventing and controlling disease and injury. With the rapid development of data science, encompassing big data and artificial intelligence, and with the exponential growth of accessible and highly heterogeneous health-related data, from healthcare providers to user-generated online content, the field of surveillance and health monitoring is changing rapidly. It is, therefore, the right time for a short glossary of key terms in public health surveillance, with an emphasis on new data-science developments in the field.
Article
Full-text available
Background: The public increasingly uses social media not only to look for information about emerging infectious diseases (EIDs), but also to share opinions, emotions, and coping strategies. Identifying the frames used in social media discussion about EIDs will allow public health agencies to assess public opinions and sentiments. Method: This study examined how the public discussed measles during the measles outbreak in the United States during early 2015 that originated in Disneyland Park in Anaheim, CA, through a semantic network analysis of the content of around 1 million tweets using KH coder. Results: Four frames were identified based on word frequencies and co-occurrence: news update, public health, vaccination, and political. The prominence of each individual frame changed over the corse of the pre-crisis, initial, maintenance, and resolution stages of the outbreak. Conclusions: This study proposed and tested a method for assessing the frames used in social media discussions about EIDs based on the creation, interpretation, and quantification of semantic networks. Public health agencies could use social media outlets, such as Twitter, to assess how the public makes sense of an EID outbreak and to create adaptive messages in communicating with the public during different stages of the crisis.
Article
Full-text available
Background: Timely understanding of public perceptions allows public health agencies to provide up-to-date responses to health crises such as infectious diseases outbreaks. Social media such as Twitter provide an unprecedented way for the prompt assessment of the large-scale public response. Objective: The aims of this study were to develop a scheme for a comprehensive public perception analysis of a measles outbreak based on Twitter data and demonstrate the superiority of the convolutional neural network (CNN) models (compared with conventional machine learning methods) on measles outbreak-related tweets classification tasks with a relatively small and highly unbalanced gold standard training set. Methods: We first designed a comprehensive scheme for the analysis of public perception of measles based on tweets, including 3 dimensions: discussion themes, emotions expressed, and attitude toward vaccination. All 1,154,156 tweets containing the word "measles" posted between December 1, 2014, and April 30, 2015, were purchased and downloaded from DiscoverText.com. Two expert annotators curated a gold standard of 1151 tweets (approximately 0.1% of all tweets) based on the 3-dimensional scheme. Next, a tweet classification system based on the CNN framework was developed. We compared the performance of the CNN models to those of 4 conventional machine learning models and another neural network model. We also compared the impact of different word embeddings configurations for the CNN models: (1) Stanford GloVe embedding trained on billions of tweets in the general domain, (2) measles-specific embedding trained on our 1 million measles related tweets, and (3) a combination of the 2 embeddings. Results: Cohen kappa intercoder reliability values for the annotation were: 0.78, 0.72, and 0.80 on the 3 dimensions, respectively. Class distributions within the gold standard were highly unbalanced for all dimensions. The CNN models performed better on all classification tasks than k-nearest neighbors, naïve Bayes, support vector machines, or random forest. Detailed comparison between support vector machines and the CNN models showed that the major contributor to the overall superiority of the CNN models is the improvement on recall, especially for classes with low occurrence. The CNN model with the 2 embedding combination led to better performance on discussion themes and emotions expressed (microaveraging F1 scores of 0.7811 and 0.8592, respectively), while the CNN model with Stanford embedding achieved best performance on attitude toward vaccination (microaveraging F1 score of 0.8642). Conclusions: The proposed scheme can successfully classify the public's opinions and emotions in multiple dimensions, which would facilitate the timely understanding of public perceptions during the outbreak of an infectious disease. Compared with conventional machine learning methods, our CNN models showed superiority on measles-related tweet classification tasks with a relatively small and highly unbalanced gold standard. With the success of these tasks, our proposed scheme and CNN-based tweets classification system is expected to be useful for the analysis of tweets about other infectious diseases such as influenza and Ebola.
Article
Background Real-time surveillance in the field of health informatics has emerged as a growing domain of interest among worldwide researchers. Evolution in this field has helped in the introduction of various initiatives related to public health informatics. Surveillance systems in the area of health informatics utilizing social media information have been developed for early prediction of disease outbreaks and to monitor diseases. In the past few years, the availability of social media data, particularly Twitter data, enabled real-time syndromic surveillance that provides immediate analysis and instant feedback to those who are charged with follow-ups and investigation of potential outbreaks. In this paper, we review the recent work, trends, and machine learning(ML) text classification approaches used by surveillance systems seeking social media data in the healthcare domain. We also highlight the limitations and challenges followed by possible future directions that can be taken further in this domain. Methods To study the landscape of research in health informatics performing surveillance of the various health-related data posted on social media or web-based platforms, we present a bibliometric analysis of the 1240 publications indexed in multiple scientific databases(IEEE, ACM Digital Library, ScienceDirect, PubMed) from the year 2010-2018. The papers were further reviewed based on the various machine learning algorithms used for analyzing health-related text posted on social media platforms. Findings Based on the corpus of 148 selected articles, the study finds the types of social media or web-based platforms used for surveillance in the healthcare domain, along with the health topic(s) studied by them. In the corpus of selected articles, we found 26 articles were using machine learning technique. These articles were studied to find commonly used ML techniques. The majority of studies (24%) focused on the surveillance of flu or influenza-like illness(ILI). Twitter (64%) is the most popular data source to perform surveillance research using social media text data, and Support Vector Machine(SVM) (33%) being the most used ML algorithm for text classification. Conclusions The inclusion of online data in surveillance systems has improved the disease prediction ability over traditional syndromic surveillance systems. However, social media based surveillance systems have many limitations and challenges, including noise, demographic bias, privacy issues, etc. Our paper mentions future directions, which can be useful for researchers working in the area. Researchers can use this paper as a library for social media based surveillance systems in the healthcare domain and can expand such systems by incorporating the future works discussed in our paper.
Article
Disease surveillance systems are a cornerstone of public health tracking and prevention. This review addresses the use, promise, perils, and ethics of social media– and Internet-based data collection for public health surveillance. Our review highlights untapped opportunities for integrating digital surveillance in public health and current applications that could be improved through better integration, validation, and clarity on rules surrounding ethical considerations. Promising developments include hybrid systems that couple traditional surveillance data with data from search queries, social media posts, and crowdsourcing. In the future, it will be important to identify opportunities for public and private partnerships, train public health experts in data science, reduce biases related to digital data (gathered from Internet use, wearable devices, etc.), and address privacy. We are on the precipice of an unprecedented opportunity to track, predict, and prevent global disease burdens in the population using digital data. Expected final online publication date for the Annual Review of Public Health, Volume 41 is April 1, 2020. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
Event detection has been a significant topic for a long time, since the onset development of pervasive systems. The ability to gather data from various sensors, in a diverse number of formats, is a challenge due to the continuous growth of data volume. Users of social media act as human sensors, providing data and information in real time about entities and events. Most of the research about event detection – using human or non-human sensors – concentrates only on identifying events. These models assume an event to be a single entity and ignoring that it can be composed of other new events over time. The detection of subevents enriches the understanding of the main event, contextualizing it and creating a powerful knowledge about the scenario. To capture the parts of an event and the information changing over time, we created a scalable and modular topic modeling based algorithm. It identifies subevents and creates labels to represent them more accurately. We evaluate the proposed sub-event detection approach using two large-scale Twitter corpus. The first one is related to Brazil's political protests scenario. The second analyzes the Zika Virus epidemic in the world. Our approach detected several subevents, most of them are related to real subevents. Due to the nature of social networks, with a minimum delay between an event occurrence and its dissemination, these results can open an opportunity for temporal tracking of emergence and outbreak scenarios.
Article
Syphilis rates have been rapidly rising in the United States. New technologies, such as social media, might be used to anticipate and prevent the spread of disease. Because social media data collection is easy and inexpensive, integration of social media data into syphilis surveillance may be a cost-effective surveillance strategy, especially in low-resource regions. People are increasingly using social media to discuss health-related issues, such as sexual risk behaviors, allowing social media to be a potential tool for public health and medical research. This study mined Twitter data to assess whether social media could be used to predict syphilis cases in 2013 based on 2012 data. We collected 2012 and 2013 county-level primary and secondary (P&S) and early latent syphilis cases reported to the Center for Disease Control and Prevention, along with >8500 geolocated tweets in the United States that were filtered to include sexual risk-related keywords, including colloquial terms for intercourse. We assessed the relationship between syphilis-related tweets and actual case reports by county, controlling for socioeconomic indicators and prior year syphilis cases. We found a significant positive relationship between tweets and cases of P&S and early latent syphilis. This study shows that social media may be an additional tool to enhance syphilis prediction and surveillance.