ArticlePublisher preview available

A Hybrid Approach of Machine Learning and Lexicons to Sentiment Analysis: Enhanced Insights from Twitter Data of Natural Disasters

September 2021
Information Systems Frontiers 23(Jan):1-24

September 2021
23(Jan):1-24

DOI:10.1007/s10796-021-10107-x

Authors:

Pankaj Dutta

Indian Institute of Technology Bombay

Abhishek Behl

Management Development Institute Gurgaon

Stefan Lessmann

Humboldt-Universität zu Berlin

The success factor of sentimental analysis lies in identifying the most occurring and relevant opinions among users relating to the particular topic. In this paper, we develop a framework to analyze users’ sentiments on Twitter on natural disasters using the data pre-processing techniques and a hybrid of machine learning, statistical modeling, and lexicon-based approach. We choose TF-IDF and K-means for sentiment classification among affinitive and hierarchical clustering. Latent Dirichlet Allocation, a pipeline of Doc2Vec and K-means used to capture themes, then perform multi-level polarity indices classification and its time series analysis. In our study, we draw insights from 243,746 tweets for Kerala’s 2018 natural disasters in India. The key findings of the study are the classification of sentiments based on similarity and polarity indices and identifying themes among the topics discussed on Twitter. We observe different sets of emotions and influencers, among others. Through this case example of Kerala floods, it shows how the government and other organizations could track the positive/negative sentiments concerning time and location; gain a better understanding of the topic of discussion trending among the public, and collaborate with crucial Twitter users/influencers to spread and figure out the gaps in the implementation of schemes in terms of design and execution. This research’s uniqueness is the streamlined and efficient combination of algorithms and techniques embedded in the framework used in achieving the above output, which can be integrated into a platform with GUI for further automation.

Sentiment analysis framework

…

Document Word Count distribution by dominant topic – Collapsed Gibbs Sampling

…

Topic keyword importance – Collapsed Gibbs Sampling

…

Doc2Vec – Optimal number of topics

…

+13

Relationship between TF-IDF, LDA, and Doc2Vec

…

Figures - available from: Information Systems Frontiers

This content is subject to copyright. Terms and conditions apply.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Information Systems Frontiers

This content is subject to copyright. Terms and conditions apply.

A Hybrid Approach of Machine Learning and Lexicons to Sentiment

Analysis: Enhanced Insights from Twitter Data of Natural Disasters

Shalak Mendon

1,2

&Pankaj Dutta

&Abhishek Behl

&Stefan Lessmann

Accepted: 7 January 2021

#Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract

The success factor of sentimental analysis lies in identifying the most occurring and relevant opinions among users relating to the

particulartopic. In this paper, we develop a framework to analyze users’sentiments on Twitter on natural disasters using the data

pre-processing techniques and a hybrid of machine learning, statistical modeling, and lexicon-based approach. We choose TF-

IDF and K-means for sentiment classification among affinitiveand hierarchical clustering. Latent Dirichlet Allocation, a pipeline

of Doc2Vec and K-means used to capture themes, then perform multi-level polarity indices classification and its time series

analysis. In our study, we draw insights from 243,746 tweets for Kerala’s 2018 natural disasters in India. The key findings of the

study are the classification of sentiments based on similarity and polarity indices and identifying themes among the topics

discussed on Twitter. We observe different sets of emotions and influencers, among others. Through this case example of

Kerala floods, it shows how the government and other organizations could track the positive/negative sentiments concerning

time and location; gain a better understanding of the topic of discussion trending among the public, and collaborate with crucial

Twitter users/influencers to spread and figure out the gaps in the implementation of schemes in terms of design and execution.

This research’s uniqueness is the streamlined and efficient combination of algorithms and techniques embedded inthe framework

used in achieving the above output, which can be integrated into a platform with GUI for further automation.

Keywords Sentimental analysis .K-means clustering .Latent Dirichlet allocation .Machine learning .Twitter .Natural disasters

1 Introduction

Sentiment analysis using social media is an emerging and

rapidly growing segment in understanding people’sopinions

concerning day-to-day events (Zahra et al. 2020). Social me-

dia websites like Twitter, Facebook, YouTube, and LinkedIn

have garnered billions of users worldwide and have been

growing at a rapid phase (Kapoor et al. 2018). Especially in

emerging countries with a high growth rate of internet pene-

tration, more and more people have adopted social media to

talk to one another, share their opinions, and listen to others’

views. The immediate transfer of data has proven to be

extremely useful in natural disasters (Liu and Xu 2018;

Bhuvana and Aram 2019).

Twitter, one of the social media websites, lets a user write

messages of the maximum length of 280 characters at a time.

These short messages help quickly convey information

among users (Tang et al. 2009; Vomfell et al. 2018).

Unlike lengthy articles, blogs are written by one user, which

takes time to analyze. Twitter messages are directly on point

and help explain the sentiments quickly. Tweets can be ana-

lyzed based on hashtags, which are typically keywords used

by people, allowing collating all sentiments of people in one

place (Khan et al. 2014; Pandey et al. 2017). In this paper, we

concentrate on developing a framework for sentimental anal-

ysis (Öztürk and Ayvaz 2018), which could be used for mul-

tiple scenarios. Our study has considered Kerala floods,

which occurred in 2018 in India (Indian Express 2018).

People worldwide used several hashtags like

#KeralaFloods, #DoForKerala, #IndiaForKerala,

#KeralaDonationChallenge among others. These keywords

were generated at different points in time and helped under-

stand people’s sentiments at different times (Bandyopadhyay

et al. 2018).

*Pankaj Dutta

pdutta@iitb.ac.in

Wipro Limited, Electronic City, Bengaluru, Karnataka 560100, India

SJM School of Management, Indian Institute of Technology

Bombay, Powai, Mumbai 400076, India

Chair of Information Systems, School of Business and Economics,

Humboldt-Universität zu Berlin, Unter den Linden 6,

10099 Berlin, Germany

https://doi.org/10.1007/s10796-021-10107-x

/ Published online: 14 February 2021

Information Systems Frontiers (2021) 23:1145–1168

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

A review of sentiment analysis: tasks, applications, and deep learning techniques

Article

Full-text available

Jul 2024

Sentiment analysis, a transformative force in natural language processing, revolutionizes diverse fields such as business, social media, healthcare, and disaster response. This review delves into the intricate landscape of sentiment analysis, exploring its significance, challenges, and evolving methodologies. We examine crucial aspects like dataset selection, algorithm choice, language considerations, and emerging sentiment tasks. The suitability of established datasets (e.g., IMDB Movie Reviews, Twitter Sentiment Dataset) and deep learning techniques (e.g., BERT) for sentiment analysis is explored. While sentiment analysis has made significant strides, it faces challenges such as deciphering sarcasm and irony, ensuring ethical use, and adapting to new domains. We emphasize the dynamic nature of sentiment analysis, encouraging further research to unlock the nuances of human sentiment expression and promote responsible and impactful applications across industries and languages.

Clustering Algorithms in Sentiment Analysis Techniques in Social Media – A Rapid Literature Review

Article

Full-text available

Jan 2024

Vasile Daniel Pavaloaia

Article

Apr 2024

Ayush Ranjan

In today’s world of growing social media usage twitter has become an enormous source of data from public in form of tweets. These tweets can be collect and useful to extract effective meaningful insights ex, public sentiments, choices and opinions about a particular event, product or person, which can prove helpful for business growth, political parties and celebrities to know public choices, sentiments and their reviews over a particular product , person or government decision. But before performing sentiment analysis on tweets data we need to clean and per-process it as tweets data is highly unstructured and noisy for that many methods are there:-data per-processing, stop-words removal, stemming, lemmatization etc. After data cleaning effective meaningful insights can be extracted and sentiment analysis can be performed to extract public opinions and their choices. Twitter data possesses significant power in capturing timely public opinions on a variety of topics, such as preferences for products, political tendencies, and sentiments within the business realm. The extensive user base of this platform provides a wide array of viewpoints, rendering it a valuable resource for comprehending consumer behaviour, political developments, and market outlooks. By employing sentiment analysis, organizations can measure levels of customer contentment, policymakers can evaluate public reception, and corporations can monitor how their brands are perceived to make well-informed decisions. The real-time aspect and high level of user involvement associated with Twitter data render it priceless for shaping strategies related to product innovation, political campaigns, and business expansion efforts. This research paper highlights the power of power BI tool in data visualization of real-time tweets of a user twitter account and builds effective dashboards and also perform sentiment analysis of real-time tweets. Keywords: LDA, Machine Learning, Topic Modelling, Sentiment Analysis, NLP, Social Media.

Real-Time Twitter Trends Analysis Using Latent Dirichlet Allocation and Machine Learning

Article

Apr 2024

Sandeep Kumar

An Aspect-Based Sentiment Analysis Model to Classify the Sentiment of Twitter Data using Long-Short Term Memory Classifier

Article

Jan 2024

A review of deep learning techniques for disaster management in social media: trends and challenges

Article

May 2024

In the present era, social media platforms have increasingly become invaluable sources of information and connectivity. Twitter(X) is one of the social media landscape’s most prominent and influential components. Certainly, Twitter data offers substantial value across a range of disaster-related applications. Its utility extends to real-time event detection, classifying diverse crisis types, and analyzing evolving sentiments throughout such events. A disaster is a catastrophic event that leads to significant disruption in the everyday operations of a community. This paper reviews the trends and challenges associated with using social media in disaster management. As part of this paper, we systematically and consistently examine several crises-including natural hazards, human-induced disasters, and health-related disasters. Different information types and sources are prevalent in different crises, leading to insights into their prevalence.

Wildfires and social media discourse: exploring mental health and emotional wellbeing through Twitter

Article

Full-text available

Apr 2024

Introduction The rise in global temperatures due to climate change has escalated the frequency and intensity of wildfires worldwide. Beyond their direct impact on physical health, these wildfires can significantly impact mental health. Conventional mental health studies predominantly rely on surveys, often constrained by limited sample sizes, high costs, and time constraints. As a result, there is an increasing interest in accessing social media data to study the effects of wildfires on mental health. Methods In this study, we focused on Twitter users affected by the California Tubbs Fire in 2017 to extract data signals related to emotional well-being and mental health. Our analysis aimed to investigate tweets posted during the Tubbs Fire disaster to gain deeper insights into their impact on individuals. Data were collected from October 8 to October 31, 2017, encompassing the peak activity period. Various analytical methods were employed to explore word usage, sentiment, temporal patterns of word occurrence, and emerging topics associated with the unfolding crisis. Results The findings show increased user engagement on wildfire-related Tweets, particularly during nighttime and early morning, especially at the onset of wildfire incidents. Subsequent exploration of emotional categories using Linguistic Inquiry and Word Count (LIWC) revealed a substantial presence of negative emotions at 43.0%, juxtaposed with simultaneous positivity in 23.1% of tweets. This dual emotional expression suggests a nuanced and complex landscape, unveiling concerns and community support within conversations. Stress concerns were notably expressed in 36.3% of the tweets. The main discussion topics were air quality, emotional exhaustion, and criticism of the president's response to the wildfire emergency. Discussion Social media data, particularly the data collected from Twitter during wildfires, provides an opportunity to evaluate the psychological impact on affected communities immediately. This data can be used by public health authorities to launch targeted media campaigns in areas and hours where users are more active. Such campaigns can raise awareness about mental health during disasters and connect individuals with relevant resources. The effectiveness of these campaigns can be enhanced by tailoring outreach efforts based on prevalent issues highlighted by users. This ensures that individuals receive prompt support and mitigates the psychological impacts of wildfire disasters.

Progress and Innovations in Artificial Intelligence and Machine Learning

Book

Full-text available

Feb 2024

In summary, generative AI is a powerful tool for enhancing the capabilities of 5G and 6G technology. It aids in modeling, optimization, security, and adaptability, enabling these advanced wireless communication systems to deliver high-performance, low-latency, and reliable connectivity for a wide range of applications and devices.

Sentiment Analysis Model Using Deep Learning

Chapter

Mar 2024

The customer’s opinion about goods and services is highly valued by both customers and producers. As a result, both industry and academia have invested a lot of time and energy on sentiment analysis. Sentiment analysis offers suggestions for enhancing the quality of the product and aids the consumer in choosing what to buy. What is the use of classifying a statement as positive or negative? Take the Amazon website as an example. On Amazon, customers may submit reviews of products, rating them as either good, terrible, or even neutral. It would be costly and time-consuming to use a human to look through all the comments and compile customer feedback on the item as a whole. Deep learning models are capable of processing enormous amounts of data, drawing conclusions and categorizing comments. The most recent research using deep learning to address issues with sentiment analysis, like sentiment polarity, is reviewed in this paper. A dataset has been subjected to the application of models utilizing Term Frequency-Inverse Document Frequency (TF-IDF), CNN and fission–fusion interactive optimization algorithm. The spider monkey optimization-based deep convolutional neural network is utilized in this study to classify the emotions expressed in the tweets. Using this DL model, businesses like Amazon may improve their products based on user feedback, increasing sales.

Deep Learning-Based Disaster Sentiment Analysis of Unbalanced Twitter Data: The Case of Indonesia Earthquake

Conference Paper

Nov 2023

Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

Conference Paper

Full-text available

Oct 2017

Automatic identification of eyewitness messages on Twitter during disasters

Article

Full-text available

Sep 2019

Social media platforms such as Twitter provide convenient ways to share and consume important information during disasters and emergencies. Information from bystanders and eyewitnesses can be useful for law enforcement agencies and humanitarian organizations to get firsthand and credible information about an ongoing situation to gain situational awareness among other potential uses. However, the identification of eyewitness reports on Twitter is a challenging task. This work investigates different types of sources on tweets related to eyewitnesses and classifies them into three types (i) direct eyewitnesses, (ii) indirect eyewitnesses, and (iii) vulnerable eyewitnesses. Moreover, we investigate various characteristics associated with each kind of eyewitness type. We observe that words related to perceptual senses (feeling, seeing, hearing) tend to be present in direct eyewitness messages, whereas emotions, thoughts, and prayers are more common in indirect witnesses. We use these characteristics and labeled data to train several machine learning classifiers. Our results performed on several real-world Twitter datasets reveal that textual features (bag-of-words) when combined with domain-expert features achieve better classification performance. Our approach contributes a successful example for combining crowdsourced and machine learning analysis, and increases our understanding and capability of identifying valuable eyewitness reports during disasters.

Performance analysis of machine learning classifiers on improved concept vector space models

Article

Full-text available

Feb 2019
FUTURE GENER COMP SY

This paper provides a comprehensive performance analysis of parametric and non-parametric machine learning classifiers including a deep feed-forward multi-layer perceptron (MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the first variant, a weighting scheme enhanced with the notion of concept importance is used to assess weight of ontology concepts. Concept importance shows how important a concept is in an ontology and it is automatically computed by converting the ontology into a graph and then applying one of the Markov based algorithms. In the second variant of iCVS, concepts provided by the ontology and their semantically related terms are used to construct concept vectors in order to represent the document into a semantic vector space. We conducted various experiments using a variety of machine learning classifiers for three different models of document representation. The first model is a baseline concept vector space (CVS) model that relies on an exact/partial match technique to represent a document into a vector space. The second and third model is an iCVS model that employs an enhanced concept weighting scheme for assessing weights of concepts (variant 1), and the acquisition of terms that are semantically related to concepts of the ontology for semantic document representation (variant 2), respectively. Additionally, a comparison between seven different classifiers is performed for all three models using precision, recall, and F1 score. Results for multiple configurations of deep learning architecture are obtained by varying the number of hidden layers and nodes in each layer, and are compared to those obtained with conventional classifiers. The obtained results show that the classification performance is highly dependent upon the choice of a classifier, and that the Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classifiers that performed rather well for all three models.

Improving Crime Count Forecasts Using Twitter and Taxi Data

Article

Full-text available

Aug 2018
DECIS SUPPORT SYST

Crime prediction is crucial to criminal justice decision makers and efforts to prevent crime. The paper evaluates the explanatory and predictive value of human activity patterns derived from taxi trip, Twitter and Foursquare data. Analysis of a six-month period of crime data for New York City shows that these data sources improve predictive accuracy for property crime by 19% compared to using only demographic data. This effect is strongest when the novel features are used together, yielding new insights into crime prediction. Notably and in line with social disorganisation theory, the novel features cannot improve predictions for violent crimes.

Determining disaster severity through social media analysis: Testing the methodology with South East Queensland Flood tweets

Article

Oct 2019

Social media was underutilised in disaster management practices, as it was not seen as a real-time ground level information harvesting tool during a disaster. In recent years, with the increasing popularity and use of social media, people have started to express their views, experiences, images, and video evidences through different social media platforms. Consequently, harnessing such crowdsourced information has become an opportunity for authorities to obtain enhanced situation awareness data for efficient disaster management practices. Nonetheless, the current disaster-related Twitter analytics methods are not versatile enough to define disaster impacts levels as interpreted by the local communities. This paper contributes to the existing knowledge by applying and extending a well-established data analysis framework, and identifying highly impacted disaster areas as perceived by the local communities. For this, the study used real-time Twitter data posted during the 2010–2011 South East Queensland Floods. The findings reveal that: (a) Utilising Twitter is a promising approach to reflect citizen knowledge; (b) Tweets could be used to identify the fluctuations of disaster severity over time; (c) The spatial analysis of tweets validates the applicability of geo-located messages to demarcate highly impacted disaster zones.

@Houstonpolice: an exploratory case of Twitter during Hurricane Harvey

Article

Aug 2019
ONLINE INFORM REV

Purpose The purpose of this paper is to examine the Houston Police Department (HPD)’s public engagement efforts using Twitter during Hurricane Harvey, which was a large-scale urban crisis event. Design/methodology/approach This study harvested a corpus of over 13,000 tweets using Twitter’s streaming API, across three phases of the Hurricane Harvey event: preparedness, response and recovery. Both text and social network analysis (SNA) techniques were employed including word clouds, n -gram analysis and eigenvector centrality to analyze data. Findings Findings indicate that departmental tweets coalesced around topics of protocol, reassurance and community resilience. Twitter accounts of governmental agencies, such as regional police departments, local fire departments, municipal offices, and the personal accounts of city’s police and fire chiefs were the most influential actors during the period under review, and Twitter was leveraged as de facto a 9-1-1 dispatch. Practical implications Emergency management agencies should consider adopting a three-phase strategy to improve communication and narrowcast specific types of information corresponding to relevant periods of a crisis episode. Originality/value Previous studies on police agencies and social media have largely overlooked discrete periods, or phases, in crisis events. To address this gap, the current study leveraged text and SNA to investigate Twitter communications between HPD and the public. This analysis advances understanding of information flows on law enforcement social media networks during crisis and emergency events.

Facebook and Whatsapp as disaster management tools during the Chennai (India) floods of 2015

Article

Apr 2019

The digital world we live in has transformed the way we communicate, network, seek help, access information, gain knowledge, and has shaped every aspect of our lives. A wide choice of communication platforms facilitates our indulgence from texting to posting on social media, which allows us to transcend geographic limitations. Natural disasters necessitate immediate communication to know the well-being of the people concerned and to seek rescue and relief measures. One such ‘black swan’ event was the Chennai floods of 2015, in south India, where social media such as Facebook and WhatsApp became disaster management tools for social activism. This research aims to analyse how Facebook and WhatsApp were used in the management of the Chennai floods of 2015, particularly by residents of Kotturpuram and Mudichur, two of the worst-affected areas in the city. The study used a quantitative approach, carrying out a survey with judgement sampling (n = 400). The satisfaction level of using Facebook and WhatsApp among the residents of Kotturpuram and Mudichur were analyzed. The factors of sense of empowerment – information, real-time operational information, emotional appeal, situational updates, and trustworthiness – were further analyzed. The implication of social media over traditional media and how those affected gained influence and power to set reverse agenda through social media during the Chennai floods of 2015 was also investigated. The use of both Facebook and WhatsApp increased and even surpassed the use of more conventional tools of communication such as radio and television during the Chennai floods. Apart from being mere tools for communication, social media facilitated disaster management during the floods. Our analysis reveals that the information sourced from Facebook and WhatsApp chats can be an eye-opener to specific areas of resource needs and gaps in resource distribution which will help in decision-making in real disasters.

Cognitive-inspired domain adaptation of sentiment lexicons

Article

May 2019
INFORM PROCESS MANAG

Assessing disaster impacts and response using social media data in China: A case study of 2016 Wuhan rainstorm

Article

Dec 2018

Social media, with its ability to record human activities, has gained increasing attention from various fields. In this study, we developed a framework to assess disaster impacts with social media data, and examined the potential of information extraction with social media messages from Weibo platform to inform disaster response and recovery in China, using the case of 2016 Wuhan rainstorm and flood disaster. Temporal evolution of social media activities was investigated to track the process of the disaster, and further compared with observed precipitation data. Moreover, major impacts of the disaster were assessed through word frequency analysis of impact-related topics. Finally, place-related information were extracted to map the hotspots of the disaster. The results indicate that temporal variation of social media activities was consistent with the rainstorm process and significant positive correlation was found between social media activities and precipitation intensity. Word frequency analysis of impact-related topics revealed that transportation and daily life were most affected, meanwhile impacts on people's emotion and psychological activities were also notable. For hot spots affected by the disaster, more flooded sites were found in central urban districts, which are generally residential/industrial area or roads and other traffic-related places. This study demonstrated the utility of social media for disaster assessment, however ensuring the accuracy of online information and expanding the application for all phases of disaster management still pose substantial challenges for future research.

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Article

Oct 2018
INFORM SCIENCES

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.