ArticlePublisher preview available

A Hybrid Approach of Machine Learning and Lexicons to Sentiment Analysis: Enhanced Insights from Twitter Data of Natural Disasters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

The success factor of sentimental analysis lies in identifying the most occurring and relevant opinions among users relating to the particular topic. In this paper, we develop a framework to analyze users’ sentiments on Twitter on natural disasters using the data pre-processing techniques and a hybrid of machine learning, statistical modeling, and lexicon-based approach. We choose TF-IDF and K-means for sentiment classification among affinitive and hierarchical clustering. Latent Dirichlet Allocation, a pipeline of Doc2Vec and K-means used to capture themes, then perform multi-level polarity indices classification and its time series analysis. In our study, we draw insights from 243,746 tweets for Kerala’s 2018 natural disasters in India. The key findings of the study are the classification of sentiments based on similarity and polarity indices and identifying themes among the topics discussed on Twitter. We observe different sets of emotions and influencers, among others. Through this case example of Kerala floods, it shows how the government and other organizations could track the positive/negative sentiments concerning time and location; gain a better understanding of the topic of discussion trending among the public, and collaborate with crucial Twitter users/influencers to spread and figure out the gaps in the implementation of schemes in terms of design and execution. This research’s uniqueness is the streamlined and efficient combination of algorithms and techniques embedded in the framework used in achieving the above output, which can be integrated into a platform with GUI for further automation.
This content is subject to copyright. Terms and conditions apply.
A Hybrid Approach of Machine Learning and Lexicons to Sentiment
Analysis: Enhanced Insights from Twitter Data of Natural Disasters
Shalak Mendon
1,2
&Pankaj Dutta
2
&Abhishek Behl
2
&Stefan Lessmann
3
Accepted: 7 January 2021
#Springer Science+Business Media, LLC, part of Springer Nature 2021
Abstract
The success factor of sentimental analysis lies in identifying the most occurring and relevant opinions among users relating to the
particulartopic. In this paper, we develop a framework to analyze userssentiments on Twitter on natural disasters using the data
pre-processing techniques and a hybrid of machine learning, statistical modeling, and lexicon-based approach. We choose TF-
IDF and K-means for sentiment classification among affinitiveand hierarchical clustering. Latent Dirichlet Allocation, a pipeline
of Doc2Vec and K-means used to capture themes, then perform multi-level polarity indices classification and its time series
analysis. In our study, we draw insights from 243,746 tweets for Keralas 2018 natural disasters in India. The key findings of the
study are the classification of sentiments based on similarity and polarity indices and identifying themes among the topics
discussed on Twitter. We observe different sets of emotions and influencers, among others. Through this case example of
Kerala floods, it shows how the government and other organizations could track the positive/negative sentiments concerning
time and location; gain a better understanding of the topic of discussion trending among the public, and collaborate with crucial
Twitter users/influencers to spread and figure out the gaps in the implementation of schemes in terms of design and execution.
This researchs uniqueness is the streamlined and efficient combination of algorithms and techniques embedded inthe framework
used in achieving the above output, which can be integrated into a platform with GUI for further automation.
Keywords Sentimental analysis .K-means clustering .Latent Dirichlet allocation .Machine learning .Twitter .Natural disasters
1 Introduction
Sentiment analysis using social media is an emerging and
rapidly growing segment in understanding peoplesopinions
concerning day-to-day events (Zahra et al. 2020). Social me-
dia websites like Twitter, Facebook, YouTube, and LinkedIn
have garnered billions of users worldwide and have been
growing at a rapid phase (Kapoor et al. 2018). Especially in
emerging countries with a high growth rate of internet pene-
tration, more and more people have adopted social media to
talk to one another, share their opinions, and listen to others
views. The immediate transfer of data has proven to be
extremely useful in natural disasters (Liu and Xu 2018;
Bhuvana and Aram 2019).
Twitter, one of the social media websites, lets a user write
messages of the maximum length of 280 characters at a time.
These short messages help quickly convey information
among users (Tang et al. 2009; Vomfell et al. 2018).
Unlike lengthy articles, blogs are written by one user, which
takes time to analyze. Twitter messages are directly on point
and help explain the sentiments quickly. Tweets can be ana-
lyzed based on hashtags, which are typically keywords used
by people, allowing collating all sentiments of people in one
place (Khan et al. 2014; Pandey et al. 2017). In this paper, we
concentrate on developing a framework for sentimental anal-
ysis (Öztürk and Ayvaz 2018), which could be used for mul-
tiple scenarios. Our study has considered Kerala floods,
which occurred in 2018 in India (Indian Express 2018).
People worldwide used several hashtags like
#KeralaFloods, #DoForKerala, #IndiaForKerala,
#KeralaDonationChallenge among others. These keywords
were generated at different points in time and helped under-
stand peoples sentiments at different times (Bandyopadhyay
et al. 2018).
*Pankaj Dutta
pdutta@iitb.ac.in
1
Wipro Limited, Electronic City, Bengaluru, Karnataka 560100, India
2
SJM School of Management, Indian Institute of Technology
Bombay, Powai, Mumbai 400076, India
3
Chair of Information Systems, School of Business and Economics,
Humboldt-Universität zu Berlin, Unter den Linden 6,
10099 Berlin, Germany
https://doi.org/10.1007/s10796-021-10107-x
/ Published online: 14 February 2021
Information Systems Frontiers (2021) 23:1145–1168
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Similar research by Mendon et al. proposes a new hybrid framework that utilizes K-means clustering and TF-IDF to classify disaster-related sentiment [71]. Their experiment analyzed 243,746 tweets related to the 2018 Kerala floods in India. ...
... Their experiment analyzed 243,746 tweets related to the 2018 Kerala floods in India. The findings shed light on the varying public emotions during the disaster, and how government and other stakeholders could leverage sentiment analysis to understand public sentiment based on location and time, ultimately leading to more effective disaster management [71]. However, the study does not include a user-friendly graphical interface (GUI) for government agencies to utilize in future disaster response efforts. ...
Article
Full-text available
Sentiment analysis, a transformative force in natural language processing, revolutionizes diverse fields such as business, social media, healthcare, and disaster response. This review delves into the intricate landscape of sentiment analysis, exploring its significance, challenges, and evolving methodologies. We examine crucial aspects like dataset selection, algorithm choice, language considerations, and emerging sentiment tasks. The suitability of established datasets (e.g., IMDB Movie Reviews, Twitter Sentiment Dataset) and deep learning techniques (e.g., BERT) for sentiment analysis is explored. While sentiment analysis has made significant strides, it faces challenges such as deciphering sarcasm and irony, ensuring ethical use, and adapting to new domains. We emphasize the dynamic nature of sentiment analysis, encouraging further research to unlock the nuances of human sentiment expression and promote responsible and impactful applications across industries and languages.
...  Clustering text [46]: These algorithms are a type of unsupervised learning algorithms that are used in SA to group similar text samples together based on their sentiment. Clustering algorithms can be useful for tasks such as discovering latent themes/topics [15] within a dataset of text or grouping similar text samples together for further analysis. ...
... The main objective of clustering algorithms in SA is to group together [46], [17], [34] reviews or texts that express similar opinions, attitudes, or emotions. This can be useful in identifying common themes or topics in a set of reviews, understanding how different sentiments are distributed across a dataset, and identifying outliers or abnormal observations. ...
... Concentrates on Twitter opinion examination during catastrophic events. Techniques for feeling investigation utilizing vocabularies and AI [23]. NEUROIMAGE positioned first in distribution volume and effect factor. ...
Article
In today’s world of growing social media usage twitter has become an enormous source of data from public in form of tweets. These tweets can be collect and useful to extract effective meaningful insights ex, public sentiments, choices and opinions about a particular event, product or person, which can prove helpful for business growth, political parties and celebrities to know public choices, sentiments and their reviews over a particular product , person or government decision. But before performing sentiment analysis on tweets data we need to clean and per-process it as tweets data is highly unstructured and noisy for that many methods are there:-data per-processing, stop-words removal, stemming, lemmatization etc. After data cleaning effective meaningful insights can be extracted and sentiment analysis can be performed to extract public opinions and their choices. Twitter data possesses significant power in capturing timely public opinions on a variety of topics, such as preferences for products, political tendencies, and sentiments within the business realm. The extensive user base of this platform provides a wide array of viewpoints, rendering it a valuable resource for comprehending consumer behaviour, political developments, and market outlooks. By employing sentiment analysis, organizations can measure levels of customer contentment, policymakers can evaluate public reception, and corporations can monitor how their brands are perceived to make well-informed decisions. The real-time aspect and high level of user involvement associated with Twitter data render it priceless for shaping strategies related to product innovation, political campaigns, and business expansion efforts. This research paper highlights the power of power BI tool in data visualization of real-time tweets of a user twitter account and builds effective dashboards and also perform sentiment analysis of real-time tweets. Keywords: LDA, Machine Learning, Topic Modelling, Sentiment Analysis, NLP, Social Media.
... Concentrates on Twitter opinion examination during catastrophic events. Techniques for feeling investigation utilizing vocabularies and AI [23]. NEUROIMAGE positioned first in distribution volume and effect factor. ...
Article
In today’s world of growing social media usage twitter has become an enormous source of data from public in form of tweets. These tweets can be collect and useful to extract effective meaningful insights ex, public sentiments, choices and opinions about a particular event, product or person, which can prove helpful for business growth, political parties and celebrities to know public choices, sentiments and their reviews over a particular product , person or government decision. But before performing sentiment analysis on tweets data we need to clean and per-process it as tweets data is highly unstructured and noisy for that many methods are there:-data per-processing, stop-words removal, stemming, lemmatization etc. After data cleaning effective meaningful insights can be extracted and sentiment analysis can be performed to extract public opinions and their choices. Twitter data possesses significant power in capturing timely public opinions on a variety of topics, such as preferences for products, political tendencies, and sentiments within the business realm. The extensive user base of this platform provides a wide array of viewpoints, rendering it a valuable resource for comprehending consumer behaviour, political developments, and market outlooks. By employing sentiment analysis, organizations can measure levels of customer contentment, policymakers can evaluate public reception, and corporations can monitor how their brands are perceived to make well-informed decisions. The real-time aspect and high level of user involvement associated with Twitter data render it priceless for shaping strategies related to product innovation, political campaigns, and business expansion efforts. This research paper highlights the power of power BI tool in data visualization of real-time tweets of a user twitter account and builds effective dashboards and also perform sentiment analysis of real-time tweets. Keywords: LDA, Machine Learning, Topic Modelling, Sentiment Analysis, NLP, Social Media.
... Sentiment analysis is one of the research areas in Natural Language Processing (NLP) which helps in the detection of sentiment present in the tweets. Many people use the SA as a tool to evaluate reviews, polls, economic reports, etc. (4,5) . During the time of sentiment analysis, tokenization is performed to tokenize the words and phrases present in the tweets. ...
Article
In the present era, social media platforms have increasingly become invaluable sources of information and connectivity. Twitter(X) is one of the social media landscape’s most prominent and influential components. Certainly, Twitter data offers substantial value across a range of disaster-related applications. Its utility extends to real-time event detection, classifying diverse crisis types, and analyzing evolving sentiments throughout such events. A disaster is a catastrophic event that leads to significant disruption in the everyday operations of a community. This paper reviews the trends and challenges associated with using social media in disaster management. As part of this paper, we systematically and consistently examine several crises-including natural hazards, human-induced disasters, and health-related disasters. Different information types and sources are prevalent in different crises, leading to insights into their prevalence.
Article
Full-text available
Introduction The rise in global temperatures due to climate change has escalated the frequency and intensity of wildfires worldwide. Beyond their direct impact on physical health, these wildfires can significantly impact mental health. Conventional mental health studies predominantly rely on surveys, often constrained by limited sample sizes, high costs, and time constraints. As a result, there is an increasing interest in accessing social media data to study the effects of wildfires on mental health. Methods In this study, we focused on Twitter users affected by the California Tubbs Fire in 2017 to extract data signals related to emotional well-being and mental health. Our analysis aimed to investigate tweets posted during the Tubbs Fire disaster to gain deeper insights into their impact on individuals. Data were collected from October 8 to October 31, 2017, encompassing the peak activity period. Various analytical methods were employed to explore word usage, sentiment, temporal patterns of word occurrence, and emerging topics associated with the unfolding crisis. Results The findings show increased user engagement on wildfire-related Tweets, particularly during nighttime and early morning, especially at the onset of wildfire incidents. Subsequent exploration of emotional categories using Linguistic Inquiry and Word Count (LIWC) revealed a substantial presence of negative emotions at 43.0%, juxtaposed with simultaneous positivity in 23.1% of tweets. This dual emotional expression suggests a nuanced and complex landscape, unveiling concerns and community support within conversations. Stress concerns were notably expressed in 36.3% of the tweets. The main discussion topics were air quality, emotional exhaustion, and criticism of the president's response to the wildfire emergency. Discussion Social media data, particularly the data collected from Twitter during wildfires, provides an opportunity to evaluate the psychological impact on affected communities immediately. This data can be used by public health authorities to launch targeted media campaigns in areas and hours where users are more active. Such campaigns can raise awareness about mental health during disasters and connect individuals with relevant resources. The effectiveness of these campaigns can be enhanced by tailoring outreach efforts based on prevalent issues highlighted by users. This ensures that individuals receive prompt support and mitigates the psychological impacts of wildfire disasters.
Book
Full-text available
In summary, generative AI is a powerful tool for enhancing the capabilities of 5G and 6G technology. It aids in modeling, optimization, security, and adaptability, enabling these advanced wireless communication systems to deliver high-performance, low-latency, and reliable connectivity for a wide range of applications and devices.
Chapter
The customer’s opinion about goods and services is highly valued by both customers and producers. As a result, both industry and academia have invested a lot of time and energy on sentiment analysis. Sentiment analysis offers suggestions for enhancing the quality of the product and aids the consumer in choosing what to buy. What is the use of classifying a statement as positive or negative? Take the Amazon website as an example. On Amazon, customers may submit reviews of products, rating them as either good, terrible, or even neutral. It would be costly and time-consuming to use a human to look through all the comments and compile customer feedback on the item as a whole. Deep learning models are capable of processing enormous amounts of data, drawing conclusions and categorizing comments. The most recent research using deep learning to address issues with sentiment analysis, like sentiment polarity, is reviewed in this paper. A dataset has been subjected to the application of models utilizing Term Frequency-Inverse Document Frequency (TF-IDF), CNN and fission–fusion interactive optimization algorithm. The spider monkey optimization-based deep convolutional neural network is utilized in this study to classify the emotions expressed in the tweets. Using this DL model, businesses like Amazon may improve their products based on user feedback, increasing sales.
Article
Full-text available
Social media platforms such as Twitter provide convenient ways to share and consume important information during disasters and emergencies. Information from bystanders and eyewitnesses can be useful for law enforcement agencies and humanitarian organizations to get firsthand and credible information about an ongoing situation to gain situational awareness among other potential uses. However, the identification of eyewitness reports on Twitter is a challenging task. This work investigates different types of sources on tweets related to eyewitnesses and classifies them into three types (i) direct eyewitnesses, (ii) indirect eyewitnesses, and (iii) vulnerable eyewitnesses. Moreover, we investigate various characteristics associated with each kind of eyewitness type. We observe that words related to perceptual senses (feeling, seeing, hearing) tend to be present in direct eyewitness messages, whereas emotions, thoughts, and prayers are more common in indirect witnesses. We use these characteristics and labeled data to train several machine learning classifiers. Our results performed on several real-world Twitter datasets reveal that textual features (bag-of-words) when combined with domain-expert features achieve better classification performance. Our approach contributes a successful example for combining crowdsourced and machine learning analysis, and increases our understanding and capability of identifying valuable eyewitness reports during disasters.
Article
Full-text available
This paper provides a comprehensive performance analysis of parametric and non-parametric machine learning classifiers including a deep feed-forward multi-layer perceptron (MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the first variant, a weighting scheme enhanced with the notion of concept importance is used to assess weight of ontology concepts. Concept importance shows how important a concept is in an ontology and it is automatically computed by converting the ontology into a graph and then applying one of the Markov based algorithms. In the second variant of iCVS, concepts provided by the ontology and their semantically related terms are used to construct concept vectors in order to represent the document into a semantic vector space. We conducted various experiments using a variety of machine learning classifiers for three different models of document representation. The first model is a baseline concept vector space (CVS) model that relies on an exact/partial match technique to represent a document into a vector space. The second and third model is an iCVS model that employs an enhanced concept weighting scheme for assessing weights of concepts (variant 1), and the acquisition of terms that are semantically related to concepts of the ontology for semantic document representation (variant 2), respectively. Additionally, a comparison between seven different classifiers is performed for all three models using precision, recall, and F1 score. Results for multiple configurations of deep learning architecture are obtained by varying the number of hidden layers and nodes in each layer, and are compared to those obtained with conventional classifiers. The obtained results show that the classification performance is highly dependent upon the choice of a classifier, and that the Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classifiers that performed rather well for all three models.
Article
Full-text available
Crime prediction is crucial to criminal justice decision makers and efforts to prevent crime. The paper evaluates the explanatory and predictive value of human activity patterns derived from taxi trip, Twitter and Foursquare data. Analysis of a six-month period of crime data for New York City shows that these data sources improve predictive accuracy for property crime by 19% compared to using only demographic data. This effect is strongest when the novel features are used together, yielding new insights into crime prediction. Notably and in line with social disorganisation theory, the novel features cannot improve predictions for violent crimes.
Article
Social media was underutilised in disaster management practices, as it was not seen as a real-time ground level information harvesting tool during a disaster. In recent years, with the increasing popularity and use of social media, people have started to express their views, experiences, images, and video evidences through different social media platforms. Consequently, harnessing such crowdsourced information has become an opportunity for authorities to obtain enhanced situation awareness data for efficient disaster management practices. Nonetheless, the current disaster-related Twitter analytics methods are not versatile enough to define disaster impacts levels as interpreted by the local communities. This paper contributes to the existing knowledge by applying and extending a well-established data analysis framework, and identifying highly impacted disaster areas as perceived by the local communities. For this, the study used real-time Twitter data posted during the 2010–2011 South East Queensland Floods. The findings reveal that: (a) Utilising Twitter is a promising approach to reflect citizen knowledge; (b) Tweets could be used to identify the fluctuations of disaster severity over time; (c) The spatial analysis of tweets validates the applicability of geo-located messages to demarcate highly impacted disaster zones.
Article
Purpose The purpose of this paper is to examine the Houston Police Department (HPD)’s public engagement efforts using Twitter during Hurricane Harvey, which was a large-scale urban crisis event. Design/methodology/approach This study harvested a corpus of over 13,000 tweets using Twitter’s streaming API, across three phases of the Hurricane Harvey event: preparedness, response and recovery. Both text and social network analysis (SNA) techniques were employed including word clouds, n -gram analysis and eigenvector centrality to analyze data. Findings Findings indicate that departmental tweets coalesced around topics of protocol, reassurance and community resilience. Twitter accounts of governmental agencies, such as regional police departments, local fire departments, municipal offices, and the personal accounts of city’s police and fire chiefs were the most influential actors during the period under review, and Twitter was leveraged as de facto a 9-1-1 dispatch. Practical implications Emergency management agencies should consider adopting a three-phase strategy to improve communication and narrowcast specific types of information corresponding to relevant periods of a crisis episode. Originality/value Previous studies on police agencies and social media have largely overlooked discrete periods, or phases, in crisis events. To address this gap, the current study leveraged text and SNA to investigate Twitter communications between HPD and the public. This analysis advances understanding of information flows on law enforcement social media networks during crisis and emergency events.
Article
The digital world we live in has transformed the way we communicate, network, seek help, access information, gain knowledge, and has shaped every aspect of our lives. A wide choice of communication platforms facilitates our indulgence from texting to posting on social media, which allows us to transcend geographic limitations. Natural disasters necessitate immediate communication to know the well-being of the people concerned and to seek rescue and relief measures. One such ‘black swan’ event was the Chennai floods of 2015, in south India, where social media such as Facebook and WhatsApp became disaster management tools for social activism. This research aims to analyse how Facebook and WhatsApp were used in the management of the Chennai floods of 2015, particularly by residents of Kotturpuram and Mudichur, two of the worst-affected areas in the city. The study used a quantitative approach, carrying out a survey with judgement sampling (n = 400). The satisfaction level of using Facebook and WhatsApp among the residents of Kotturpuram and Mudichur were analyzed. The factors of sense of empowerment – information, real-time operational information, emotional appeal, situational updates, and trustworthiness – were further analyzed. The implication of social media over traditional media and how those affected gained influence and power to set reverse agenda through social media during the Chennai floods of 2015 was also investigated. The use of both Facebook and WhatsApp increased and even surpassed the use of more conventional tools of communication such as radio and television during the Chennai floods. Apart from being mere tools for communication, social media facilitated disaster management during the floods. Our analysis reveals that the information sourced from Facebook and WhatsApp chats can be an eye-opener to specific areas of resource needs and gaps in resource distribution which will help in decision-making in real disasters.
Article
Social media, with its ability to record human activities, has gained increasing attention from various fields. In this study, we developed a framework to assess disaster impacts with social media data, and examined the potential of information extraction with social media messages from Weibo platform to inform disaster response and recovery in China, using the case of 2016 Wuhan rainstorm and flood disaster. Temporal evolution of social media activities was investigated to track the process of the disaster, and further compared with observed precipitation data. Moreover, major impacts of the disaster were assessed through word frequency analysis of impact-related topics. Finally, place-related information were extracted to map the hotspots of the disaster. The results indicate that temporal variation of social media activities was consistent with the rainstorm process and significant positive correlation was found between social media activities and precipitation intensity. Word frequency analysis of impact-related topics revealed that transportation and daily life were most affected, meanwhile impacts on people's emotion and psychological activities were also notable. For hot spots affected by the disaster, more flooded sites were found in central urban districts, which are generally residential/industrial area or roads and other traffic-related places. This study demonstrated the utility of social media for disaster assessment, however ensuring the accuracy of online information and expanding the application for all phases of disaster management still pose substantial challenges for future research.
Article
The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.