ArticlePDF Available

Naïve-Bayes family for sentiment analysis during COVID-19 pandemic and classification tweets

August 2022
Indonesian Journal of Electrical Engineering and Computer Science

August 2022

DOI:10.11591/ijeecs.v28.i1.pp375-383

Authors:

University of Technology, Iraq

This paper proposes a system to analyze the sentiments of tweeters. It is to build an accurate model to detect different emotions in a tweet. The analysis takes place through several stages (i.e., pre-processing, feature extraction, and training more than one machine learning (ML)). Naïve Bayes, Multinomial Naïve Bayes and Bernoulli Naïve Bayes were selected as supervised machine learning for sentiment analysis using a dataset of 3,057 tweets with users ranging from fear to happiness, anger, and sadness because this method is suitable for solving a problem of this type. This system was also applied to another dataset of 10,000 Tweets (5,000 positive and 5,000 negatives). This approach, consisting of three Naïve Bayes classification models, was applied to two datasets to analyze the sentiment used in them and classify each category separately. The Multinomial Naïve Bayes model outperformed the other models Where it achieved an accuracy of (91.6%) when applied to the first dataset and accuracy (87.6%) when applied to the second dataset. The researchers aim to continue this research with larger data by using other methods of sentiment analysis to predict users' thoughts about COVID-19 or any other problem and to obtain higher accuracy for the models used.

Content uploaded by Murtatha Aldrraji

Content may be subject to copyright.

Indonesian Journal of Electrical Engineering and Computer Science

Vol. 28, No. 1, October 2022, pp. 375~383

ISSN: 2502-4752, DOI: 10.11591/ijeecs.v28.i1.pp375-383  375

Journal homepage: http://ijeecs.iaescore.com

Naïve-Bayes family for sentiment analysis during COVID-19

pandemic and classification tweets

Murtadha B. Ressan, Rehab F. Hassan

Department of Computer science, University of Technology, Baghdad, Iraq

Article Info

ABSTRACT

Article history:

Received Dec 4, 2021

Revised Jun 30, 2022

Accepted Jul 27, 2022

This paper proposes a system to analyze the sentiments of tweeters. It is to

build an accurate model to detect different emotions in a tweet. The analysis

takes place through several stages (i.e., pre-processing, feature extraction,

and training more than one machine learning (ML)). Naïve Bayes,

Multinomial Naïve Bayes and Bernoulli Naïve Bayes were selected as

supervised machine learning for sentiment analysis using a dataset of 3,057

tweets with users ranging from fear to happiness, anger, and sadness because

this method is suitable for solving a problem of this type. This system was

also applied to another dataset of 10,000 Tweets (5,000 positive and 5,000

negatives). This approach, consisting of three Naïve Bayes classification

models, was applied to two datasets to analyze the sentiment used in them

and classify each category separately. The Multinomial Naïve Bayes model

outperformed the other models Where it achieved an accuracy of (91.6%)

when applied to the first dataset and accuracy (87.6%) when applied to the

second dataset. The researchers aim to continue this research with larger data

by using other methods of sentiment analysis to predict users' thoughts about

COVID-19 or any other problem and to obtain higher accuracy for the

models used.

Keywords:

Bernoulli Naïve Bayes

COVID-19

Multinomial Naïve Bayes

Naïve-Bayes

Sentiment analysis Twitter

This is an open access article under the CC BY-SA license.

Corresponding Author:

Murtadha B. Ressan

Department of Computer science, University of Technology

Baghdad, Iraq

Email: cs.19.02@grad.uotechnology.edu.iq

1. INTRODUCTION

The World Health Organization (WHO) declared COVID-19 a global pandemic on January 30, 2020

[1]. It is considered one of the most widespread, influential, and dangerous epidemics in global health

history, as it causes a severe disease that sometimes leads to death [2]. Nowadays, people can share their

valuable information via powerful social media like Facebook, Twitter, and other social networking

platforms. During the pandemic period, people mainly shared their experiences, thoughts, and opinions on

Twitter. Twitter is a popular social networking platform with a large number of users, i.e. more than 500

million users worldwide. Twitter is a primary source of health-related information due to the diversity of

information shared by individuals and official bodies [3] so that this information can be fruitfully used to

study people's behavior and analyze their interaction with any therapist addressed [2], [4]. Because of this

pandemic, many businesses were disrupted and many workers lost their jobs, while the economy of some

industrial aspects such as the pharmaceutical industry and health protection tools recovered, this research

paper discusses the impact of these repercussions by identifying and analyzing their tweets to know their

feelings, as well as discussing opinion mining for randomly collected tweets It will also be detailed later. The

purpose of this research is to know the feelings of the tweeters during the pandemic period and to classify the

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 1, October 2022: 375-383

376

tweets to obtain a reliable and high-accuracy approach that can be adopted as a predictive approach to help

find solutions to such problems [5].

The dataset used in this paper is more than 3,000 tweets taken from kaggle.com, and the different

interactions of the tweets are categorized as joy, fear, anger, and sadness. Another dataset of 10,000 tweets

(5,000 positives and 5,000 negatives) was also used to classify tweets, both datasets are tweets and Twitter

comments in English. Supervised learning techniques such as the Naïve Bayes group's methods for sentiment

classification and analysis are used. The Naïve Bayes group (normal Naïve Bayes, Multinomial Naïve Bayes,

Bernoulli Naïve Bayes) was selected based on the type of problem to be solved (sentiment analysis,

classification, opinion mining) in proportion to the type of dataset where Naïve Bayes achieved good results

and a high accuracy rate in the classification of feelings in previous studies similar to this study.

For each dataset, these steps are applied separately, classification accuracy results are recorded and

restricted separately, and results tables will be displayed in the Results and Discussion pane. The dataset

progress through several stages starting from the processing process (most datasets are it contains a lot of

noise and useless components that affect badly the results of the analysis) The preprocessing stage begins

with the step of cleaning the data, tokenization, and removing stop words, then setting the part of speech and

returning each word to its origin (lemmatization) after converting words to lowercase so that the results of the

analysis accuracy are good. After that, the process of extracting features using the term frequency-inverse

document frequency (TF-IDF) method, and then the data is divided into two parts (70% for training and 30%

for examination). It is worth noting that two files were collected, the first containing (2005) positive words,

and the second containing (4,781) negative words in the English language for the second dataset, where the

words of the negative tweet are compared with the positive words collected in the file. Delete this word from

the tweet, and if a negative word is found within the positive tweet, this word is deleted from the tweet, this

step led to an increase in the accuracy of the classification.

2. RELATED WORK

Nowadays, researchers and stakeholders use social media as a robust statistical resource to analyze

sentiment for achieving or anticipating those related outcomes. Social media is a good platform for

expressing feelings, opinions, and initiatives. Twitter is one of the most popular social media platforms

across the world. It presents a decent way for people to express themselves honestly. Several studies similar

to this study will be reviewed, arranged in descending order depending on the value of accuracy.

Adamu et al. [6] six machine learning (ML) algorithms are used in this study, and a comparative

analysis of their performance is conducted. The algorithms are Multinomial Naïve Bayes (MNB), support

vector machine (SVM), random forest (RF), logistics regression (LR), K-nearest neighbor (KNN), and

decision tree (DT). The conducted experiments reveal that the SVM outperforms the remaining classifiers

with the highest accuracy of 88%.

Shofiya and Abidi [7] the research focused on analyzing the feelings of a group of Canadian

tweeters towards social distancing that is instructed officially due to COVID-19 consequences for

approximately thirty days, relying on Twitter’s data. Authors used the SentiSt Strength Tool and SVM

Classifier to carry out that analysis, resulting in 40% of neutral feelings of Canadian people with instructed

distancing, while other percentages of 35% of negative feelings, however, 25% of Canadian tweets had

positive feelings about it. The outcome showed an accuracy of 87%, using the SVM algorithm.

Villavicencio et al. in [8] studied and analyzed people's feelings towards COVID-19 vaccines in the

Philippines based on their opinions i.e., positive, neutral, or negative. According to the results, it is obvious

that 83% of the tweets were positively supporting the idea of vaccination, whereas 9% of them were neutral,

and only 8% stated negative feelings. The data was preprocessed using various natural language processing

(NLP) techniques, and a classifier model was successfully developed using the Naïve Bayes classification

algorithm with an accuracy of 81.77%. Sari and Ruldeviyani [9] research was made to analyze the sentiment

of the COVID-19 transmission to commuter line passengers. This research was implemented using a

comparison of 2 methods, Naïve Bayes outperformed the decision tree with an accuracy of 73.59%.

3. METHOD

Based on the analysis of the literature, it was determined that there are existing problems that need

to be resolved to perform sentiment classification. Adopting supervised learning significantly reduces

computational complexity and provides accuracy at the expense of a larger volume of training data. This

section discusses the proposed methodology for analyzing and categorizing sentiment for the dataset used.

With the presented dataset the spyder IDE was used as an interface to work with Microsoft Windows 10.

Sentiment analysis was performed in five steps as shown in: i) preprocessing, ii) feature extraction, iii)

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Naïve-Bayes family for sentiment analysis during COVID-19 pandemic and … (Murtadha B. Ressan)

377

tagging of Tweets and dataset segmentation, vi) Implementation of classification algorithms, and

v) comparison of evaluation results. Figure 1 illustrates these steps. Before going into details of the action

steps, it is necessary to identify and classify the dataset used in sentiment analysis. The steps of the proposed

approach were applied to two datasets that will be detailed in subsections.

Figure 1. The general structure of the proposed system

3.1. Dataset of tweeters during the COVID-19 pandemic

Table 1 demonstartes extracted dataset from kaggle.com. It is a set of 3,057 tweets, which is

representing four sentiments that reflect the feelings of tweeters during the COVID-19 pandemic. These

sentiments are joy, fear, anger, or sadness. Which will later be categorized into four categories according to

the aforementioned sentiments.

Table 1. Sentiments embedded in tweets of dataset A

Sentiment type

Count of

tweets

Sample of tweet

Joy

708

"Let us set the pace for innovation and creativity, bring out the entrepreneur in you, and

break barriers of productivity 19pic twitter com/repw43b7qf."

Sad

787

1881, Sensex closes 1 203 points lower nifty gives up 8 300 amid coronavirus crisis it

financial stocks worst hit â€¦ shared via ndtv news app (android | iPhone)

Fear

801

1763, stone pelters need to be treated as terrorists they need to be detained under terrorist,

and disruptive activities (prevention) act 1987 (tada) stone pelters are more danger than

corona we need to vanish both stone corona & pelters (terrorist) from India

Anger

761

3506,â€œif regular unemployment hits 20% black unemployment will likely be around

50% that will be a full collapse for black America â€• â€” â€œmnuchin warns senators

that we could see a 20% unemployment rate due to coronavirus

3.2. Dataset of tweets included in the Python libraries

The dataset consists of 10,000 tweets including one of the Corpus. It was into the NLTK library

with the name (twitter_samples), and these tweets are randomly selected. It's divided into (5,000 positive

tweets and 5,000 negative tweets) as shown in Table 2.

Table 2. Sentiments embedded in tweets of dataset B

Sentiment type

Count of tweets

Sample of tweet

positive

5000

"This is a movie that refreshes the mind and spirit along with the body, so

original are its content, look, and style."

negative

5000

" Stupid, naive, and undeserving of what he got "

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 1, October 2022: 375-383

378

3.3. Preprocessing

Preprocessing is one of the most effective techniques to make sure of the data's correctness. Before

applying the analysis algorithms. It includes (data cleaning, tweets splitting, sentences splitting, stopwords,

and lemmatization performing).

3.3.1. Data cleaning

Cleaning the data is an early preprocessing step to ignore any unnecessary qualities in the data to

reduce processing time and focus modeling work on the data necessary pieces. The cleanup process includes

removing unique hypertext markup language (HTML) entities, converting all characters in words to

lowercase, removing hyperlinks, punctuation, whitespace, and special characters that appear in sentences

e.g., "([0-9] +) | (#) | (@[A-Za-z0-9] +) | ([^0-9A-Za-z \t]) | (\w +: \ / \ / \ S +) " or English abbreviation e.g.,

don't = not or I'm = I am or OMG = Oh My God. Data cleansing is a very necessary step to initialize data for

the next steps of preprocessing [10]–[13].

3.2.2. Tokenization

It is the procedure of splitting raw textual data and converting it to several separated tokens that will

eventually be a word or a character. The purpose of tokenization is to research phrases in a sentence or phrase

[14], [15]. There are two types of tokenization: i) sentence tokenization: it is the process of dividing a textual

document into a group of sentences. The objective of this operation is to convert the texts into meaningful

sentences. These techniques are used to find the separation marks between two sentences such as a duration

(.) and a newline character (\n) to achieve sentence tokenization [10], [11]; ii) word tokenization: it is the

process of separating the sentence into words that form the targeted sentence, called tokens. These techniques

are used to find segregation words such as dot (.) or separator (,) whitespace to achieve tokenization between

words [12], then each word convert to lowercase.

3.3.3. Remove stop-words

Stop words are parts of the natural humane language that do not make sense or have a meaning.

A stop word is a group of usual frequented features that appears in all textual documents. Common features

e.g., conjugations such as and, or, but, pronouns, he, and she is supposed to be removed because they have a

little or no effect on the text mining process. it is difficult to understand the content of the text files that

contain stop words because of the appearance of these words as they appear very frequently. Removing stop-

words from textual documents helps to sort the text's appearance orderly by eliminating the less important

words, decreasing the amount of processed textual data, which gains better system performance [16], [17].

3.3.4. Lemmatization

Lemmatization is an important preprocessing step for several text mining applications. it is used in

natural language processing, representing a useful tool for sentiment analysis and classification processes.

Lemmatization converts each word to its basic form, the lemma. For instance, "good," "better," or "best" are

converted to the root word “good”. It helps with determining part of speech (POS) tagging, returning the

words form, with mandatory validity. The goal of lemmatization is to reduce each word's inflectional forms

and derivations, returning them to their common root [18].

3.3.5. Part of speech tagging

It is a way in which the part of speech specifies each word in a phrase. POS knowledge plays an essential

role as any word in a sentence with each POS tag has a distinct meaning depending on the sentence situation with

the essential speech components such as verb, noun, adverb, and adjective. POS tags can be used to differentiate

words and articulate their speech parts. The Lemmatization process relies mainly on POS tagging [19].

3.4. Feature extraction

It is an important process of work. The process of decreasing inputs to analyze, process, or manage

the most considerable data is called feature selection [20]. Thus, several features are extracted from the

dataset. The extracted features must be in a specific format that can directly be an input to the classification

algorithms. This paper used the TF-IDF method.

Its main application is to determine the significance of a particular word for a document from a

specific dataset. Each word in the document is assigned a weight:

   󰇛 󰇜  󰇛 󰇜 󰇛󰇜  (1)

󰇛󰇜  󰇛󰇜 (2)

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Naïve-Bayes family for sentiment analysis during COVID-19 pandemic and … (Murtadha B. Ressan)

379

where the frequency of the term 't' is represented as TF (t, d) appearing in a specific document 'd', with the

total size of 'N' represented as IDF divided by the number of documents that make up the dataset D, contains

the term t [16], [20], [21].

3.5. Tweets categorization and splitting dataset

Classify the data into specific categories, such as sadness, joy, anger, or fear, in dataset A. It is an

important piece of data in machine learning procedures. The training set of our model is based on historical

data with predefined target attributes (values). The process of marking the dataset must be done carefully

because of the susceptibility to errors that cause inaccuracies and thus affect the quality of the dataset and the

performance of the model in data analysis.

Through a process of classifying (sad, joy, anger, or fear) the training part of the dataset (tweets),

the entire dataset is pre-split into two parts, 70% for training and 30% for testing. The same procedures are

also applied with dataset B, but the categories in it are only two categories (negative and positive). A

collection of the Naïve Bayes family has been used and each one will be explained in the next subsection.

3.5.1. Naïve Bayes

The sentiment analyzer is built using the model of Naïve Bayes as a classifier. This model is

learning from the labels of the training part, performing the sentiments classification. It supposes that the

existence of a particular feature within a class is independent of the existence of the other features within the

same class. NB theory determines the probability of a specific event to have happened based on the

probabilistic related distributions of other particular events [22]-[24].

With this study, the dataset (training set) containing the labeled tweets was set as an input to each

model for training on the characteristics of the tagged sentence's emotional traits.

Based on Bayes's theory, it is expressed as:

󰇛󰇜󰇛󰇜󰇛 󰇜

󰇛󰇜 (3)

where:

- P(H|X) denotes the final probability of hypothesis H happening when a specific event E happens.

- P(X|H) denotes the probability of proofing the event E will influence H.

- P(H) denotes the initial probability when H happens irrespective of any proof.

- P(X) denotes the initial probability proof of E of H or other proof.

The two variables used to use Bayes's theory are aspects/features as H, and sentiments as E. A

sentence consists of several words, while practically, it is not easy to out which tweet can be nominated as an

aspect/feature. Thus, it is reasonable to assume every word as an aspect/feature to apply Bayes' Theorem.

󰇛󰇜󰇛󰇜󰇛󰇜

󰇛󰇜 (4)

Where:

A is a word or a feature.

C is a sentiment value or category.

Because the features of words that support one category can be many, e.g. there are features A1, A2, and A3,

the Bayes theory can be developed into:

󰇛󰇜󰇛󰇜󰇛󰇜

󰇛󰇜 (5)

because Bayes's theory requires that the evidence (in this case is a word or feature) that exists is independent

of each other, then the formula can be changed to:

󰇛󰇜󰇛󰇜󰇛󰇜󰇛󰇜󰇛󰇜

󰇛󰇜󰇛󰇜󰇛󰇜 (6)

if described in general can be formulated as shown in:

󰇛󰇜

󰇛󰇜

󰇛󰇜 (7)

because the fixed value of P(A) for a Sentiment value the P(A) value is determined only if the



󰇛󰇜 is determined.

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 1, October 2022: 375-383

380

3.5.2. Multinomial Naïve Bayes

It is similar to Naïve Bayes considering a probabilistic technique. Multinomial NB develops the

utilization of the Naïve Bayes algorithm. It uses NB for data that is partitioned multinomially, however, it is a

frequency-depended model. The Multinomial Naïve Bayes algorithm operates with the definition of the

term’s frequency, explaining the iteration of item repetition during operation. the main difference between

the classifiers of Naïve Bayes and Multinomial Naïve Bayes is that Naïve Bayes operates based on

conditional probability (as conditional independence of the characteristics is considered), however, the

Multinomial Naïve Bayes operates based on the multinomial distribution. In other words, Multinomial NB is

considered an updated version of the NB algorithm. It effectively helps to calculate the frequency of any item

[24]-[27]. Multinomial Naïve Bayes work can be illustrated by the following equation [28]-[33]:

󰇛󰇜   

ⅴ󰆒 (8)

Where:

- 󰇛󰇜: is the conditional probability of the word () that appears in the document having class c.

- : is the number of occurrences of the word () in the document having class c.

- ⅴ: is the total number of occurrences of all words in class c.

3.5.3. Bernoulli Naïve Bayes

It is a classifier that works efficiently on the binary concept when the items appear or not, unlike

Multinomial NB, Bernoulli NB does not notify the frequency of the term. It does not manipulate the same

multinomial process where the term frequencies are considered by the multinomial approach. In contrast, the

Bernoulli NB approach is only beneficial in determining the presence of a term in the text under

consideration. In the multivariate Bernoulli Naïve Bayes algorithm, features are distinct binary variables,

explaining the appearance or absence of the term in the file under specified consideration [24], [30].

Algorithm 1 depicts the implementation steps of the proposed system. Bernoulli's Naïve Bayes work can be

illustrated by (9).

󰇛 󰇜  

 (9)

Where:

- : represent term in the document.

- : represent many times terms appear in the document.

- 󰇛󰇜represent the conditional probability of the term () that appears in the document having class

-   represent the number of documents of class c that represent appear of terms ().

-  represent the total number of documents of class c.

Algorithm 1: The implementation steps of the system

Input: tweets Dataset

Output: The best Prediction Method

Begin

Step 1: for each tweet in the dataset // The same steps for the entered dataset (first or

second dataset)

call Preprocessing function

• Data cleaning

• Abbreviation processing

• Remove Numbers and other marks

• Delete website links

end for

Step 2: for each tweet in the dataset

call Tokenization function // Tokenize each tweet or comment into single words

convert to lowercase.

call Remove Stop words & punctuation function

call Word Lemmatization function

call part of speech tagging function

end for

Step 6: Splitting data into training and testing. // 70% training and 30% testing

Step 7: Switch (x) // After training the models, each model is called to check its accuracy

Step 7.1: Case (1): Call NB model for classification; Break;

Step 7.2: Case (2): Call Bernoulli NB model for classification; Break;

Step 7.3: Call Multinomial NB model for classification; Break;

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Naïve-Bayes family for sentiment analysis during COVID-19 pandemic and … (Murtadha B. Ressan)

381

End Switch

Step 6: Compare among results of the models’ accuracy measurement and choose the best

method.

End

4. RESULT AND DISCUSSION

The same steps were applied to the first dataset (subsection A) and the second dataset (subsection

B), but the difference is that the first dataset is classified into anger, fear, sadness, or joy depending on the

content of the tweet, while the second dataset is classified into negative and positive tweets depending on the

characteristics of the tweet that reflects the user (writer) orientation, the clean dataset is set as input to the

rating model and sentiment analysis. The results for each model are covered in Tables 3 and 4, showing the

models gained accuracy for each dataset.

Table 3 shows the accuracy values for each model for the first dataset. As mentioned earlier in this

paper, the Multinomial Naïve-Bayes model achieved the highest accuracy values (91.6%). It should be noted

that after applying the pre-processing steps to the dataset, words showing the direction and category of the

tweet appear from one of the four categories, and therefore it is extracted as a calculated category, while table

4, which represents the accuracy results of the models for classifying the second dataset. As in Table 3, the

Multinomial Naïve-Bayes model has outperformed the rest of the models, as it scored higher accuracy than

the rest of the models (87.6%).

Because Multinomial Naïve-Bayes depends on the principle of a frequency-depended model of the

feature and because the tweets are in a group The text data uses some repeating elements (words), where after

the pre-processing process and when extracting the attributes, this attribute will be repeated to a certain

extent, which enables the Multinomial Naïve-Bayes model to classify tweets more accurately than the rest of

the models. Compared with the related works that studied problems similar to the problem of this research,

the accuracy achieved by this system is higher than the accuracy achieved by some similar works (in terms of

the type of dataset and similarity of the method of extracting features) as in Table 5.

Table 3. Accuracy value for each model of dataset A

Model

Accuracy value

Naïve-Bayes

83.5%

Bernoulli Naïve-Bayes

83.4%

Multinomial Naïve-Bayes

91.6%

Table 4. Accuracy value for each model of Dataset B

Model

Accuracy value

Naïve-Bayes

82.4%

Bernoulli Naïve-Bayes

85.9%

Multinomial Naïve-Bayes

87.6%

Table 5. Summerize of related work

Ref

Title and publishing year

Method used

Dataset source

Feature

extraction

Highest

accuracy

[6]

Framing twitter public sentiment on

Nigerian government COVID-19

palliatives distribution using

machine learning-2021

Multinomial Naïve Bayes (MNB),

Support Vector Machine (SVM),

Random Forest (RF), Logistics

Regression (LR), K-Nearest

Neighbor (KNN), and Decision Tree

(DT)

Nigerian

Local English

Slang-Pidgin

(NLES-P)

TF-IDF

Support

Vector

Machine

accuracy was

88%.

[7]

Sentiment analysis on covid-19-

related social distancing in Canada

using Twitter data - 2021

Support Vector Machine

Twitter

TF-IDF

SVM accuracy

was 87%.

[8]

Twitter sentiment analysis towards

covid-19 vaccines in the Philippines

using naïve Bayes-2021

Naïve Bayes

Twitter

TF-IDF

Naïve Bayes

algorithm

accuracy was

81.77%

[9]

Sentiment Analysis of the Covid-19

Virus Infection in Indonesian Public

Transportation on Twitter Data: A

Case Study of Commuter Line

Passengers-2020

Naïve Bayes and Decision Tree

Twitter

(unknown)

Naïve Bayes

algorithm

accuracy was

73.59%

Based on the results presented in Tables 3 and 4 and Figure 2, it is possible to adopt the results of

(Multinomial NB) as the most efficient and optimal algorithm, while dispensing with the rest of the

algorithms to solve a problem of this kind.

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 1, October 2022: 375-383

382

Figure 2. Accuracy value for each model and both datasets

5. CONCLUSION

The objective of this study is to analyze and classify the data of social media users and to know their

attitudes, sentiments, and interaction with an event. The study concluded that the features used, which

represent the direction of the tweet, usually, these features are repeated, so the multinomial NB model

succeeded in achieving a higher classification accuracy than the rest of the used machine learning models (it

achieved an accuracy of 91.6% in the first dataset and an accuracy of 87.6% for the second dataset). Despite

the different topics of tweets for the two datasets, the system achieved a good accuracy value when applied to

them, as indicated by the accuracy tables. From the observations of this study, it is possible to increase the

classification accuracy of the approach either by using more powerful models or by using the method of

extracting other features depending on the type of data and the extent to which they are free of impurities, as

well as the type of problem that this approach is intended to solve. The researchers seek to do similar work to

this study, which was a prediction of the users’ opinion regarding a particular product or issue before it is

released on the ground, and also the researchers seek to link this approach directly with Twitter through the

(tweeps) library in the Python programming language to be analyzing tweets interactively directly.

REFERENCES

[1] U. Naseem, I. Razzak, M. Khushi, P. W. Eklund, and J. Kim, “COVIDSenti: a large-scale benchmark twitter data Set for COVID-

19 sentiment analysis,” IEEE Transactions on Computational Social Systems, vol. 8, no. 4, pp. 976–988, Aug. 2021, doi:

10.1109/TCSS.2021.3051189.

[2] M. E. Basiri, S. Nemati, M. Abdar, S. Asadi, and U. R. Acharrya, “A novel fusion-based deep learning model for sentiment

analysis of COVID-19 tweets,” Knowledge-Based Systems, vol. 228, p. 107242, Sep. 2021, doi: 10.1016/j.knosys.2021.107242.

[3] R. Kimmons, J. Rosenberg, and B. Allman, “Trends in educational technology: what Facebook, Twitter, and Scopus can tell us

about current research and practice,” TechTrends, vol. 65, no. 2, pp. 125–136, 2021, doi: 10.1007/s11528-021-00589-6.

[4] D. H. Abd, A. T. Sadiq, and A. R. Abbas, “Political Arabic articles classification based on machine learning and hybrid vector,”

in 2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA),

Nov. 2020, pp. 1–7, doi: 10.1109/CITISIA50690.2020.9371791.

[5] D. V. Cruz, V. F. Cortez, A. L. Chau, and R. S. Almazán, “Does Twitter affect stock market decisions? financial sentiment

analysis during pandemics: a comparative study of the H1N1 and the COVID-19 periods,” Cognitive Computation, vol. 14, no. 1,

pp. 372–387, Jan. 2022, doi: 10.1007/s12559-021-09819-8.

[6] H. Adamu, S. L. Lutfi, N. H. A. H. Malim, R. Hassan, A. Di Vaio, and A. S. A. Mohamed, “Framing Twitter public sentiment on

Nigerian government COVID-19 palliatives distribution using machine learning,” Sustainability, vol. 13, no. 6, p. 3497, Mar.

2021, doi: 10.3390/su13063497.

[7] C. Shofiya and S. Abidi, “Sentiment analysis on COVID-19-related social distancing in Canada using Twitter data,” International

Journal of Environmental Research and Public Health, vol. 18, no. 11, p. 5993, Jun. 2021, doi: 10.3390/ijerph18115993.

[8] C. Villavicencio, J. J. Macrohon, X. A. Inbaraj, J. H. Jeng, and J. G. Hsieh, “Twitter sentiment analysis towards covid-19 vaccines

in the Philippines using naïve bayes,” Information, vol. 12, no. 5, 2021, doi: 10.3390/info12050204.

[9] I. C. Sari and Y. Ruldeviyani, “Sentiment analysis of the Covid-19 virus infection in Indonesian public transportation on twitter

data: a case study of commuter line passengers,” in 2020 International Workshop on Big Data and Information Security (IWBIS),

Oct. 2020, pp. 23–28, doi: 10.1109/IWBIS50925.2020.9255531.

[10] S. Vijayarani and R. Janani, “Text mining: open source tokenization tools-an analysis,” Advanced Computational Intelligence: An

International Journal (ACII), vol. 3, no. 1, pp. 37–47, 2016.

[11] S. Vijayarani, J. Ilamathi, and Nithya, “Preprocessing techniques for text mining-an overview,” International Journal of

Computer Science & Communication Networks, vol. 5, no. 1, pp. 7–16, 2015.

[12] F. M. J. M. Shamrat et al., “Sentiment analysis on twitter tweets about COVID-19 vaccines usi ng NLP and supervised KNN

classification algorithm,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, no. 1, p. 463, Jul. 2021,

doi: 10.11591/ijeecs.v23.i1.pp463-470.

[13] D. H. Abd, A. R. Abbas, and A. T. Sadiq, “Analyzing sentiment system to specify polarity by lexicon-based,” Bulletin of

Electrical Engineering and Informatics, vol. 10, no. 1, pp. 283–289, 2021, doi: 10.11591/eei.v10i1.2471.

[14] H. Zhao, L. Huang, R. Zhang, Q. Lu, and H. Xue, “SpanMlt: A span-based multi-task learning framework for pair-wise aspect

and opinion terms extraction,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020,

pp. 3239–3248, doi: 10.18653/v1/2020.acl-main.296.

75.00%

80.00%

85.00%

90.00%

95.00%

Naïve-BayesBernoulli Naïve-BayesMultinomial Naïve-

Bayes

Accuracy Value

Accuracy of data set A Accuracy of data set A

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Naïve-Bayes family for sentiment analysis during COVID-19 pandemic and … (Murtadha B. Ressan)

383

[15] D. Sarkar, Text Analytics with Python - A Practitioner’s Guide to Natural Language Processing. Springer, 2019.

[16] A. I. Kadhim, “An evaluation of preprocessing techniques for text classification,” International Journal of Computer Science and

Information Security, vol. 16, no. 6, pp. 22–32, 2018, [Online]. Available: https://sites.google.com/site/ijcsis/.

[17] S. Qaiser and R. Ali, “Text mining: Use of TF-IDF to examine the relevance of words to documents,” International Journal of

Computer Applications, vol. 181, no. 1, pp. 25–29, Jul. 2018, doi: 10.5120/ijca2018917395.

[18] V. Dunjko and H. J. Briegel, “Machine learning & artificial intelligence in the quantum domain: A review of recent progress,”

Reports on Progress in Physics, vol. 81, no. 7, p. 074001, Jul. 2018, doi: 10.1088/1361-6633/aab406.

[19] M. R. Khatun, S. I. Ayon, M. R. Hossain, and M. J. Alam, “Data mining technique to analyse and predict crime using crime

categories and arrest records,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 22, no. 2, p. 1052, May

2021, doi: 10.11591/ijeecs.v22.i2.pp1052-1060.

[20] G. Singh, B. Kumar, L. Gaur, and A. Tyagi, “Comparison between multinomial and Bernoulli Naïve Bayes for text

classification,” in 2019 International Conference on Automation, Computational and Technology Management, ICACTM 2019,

Apr. 2019, pp. 593–596, doi: 10.1109/ICACTM.2019.8776800.

[21] J. K. Kruschke and T. M. Liddell, “Bayesian data analysis for newcomers,” Psychonomic Bulletin & Review, vol. 25, no. 1, pp.

155–177, Feb. 2018, doi: 10.3758/s13423-017-1272-1.

[22] J. A. Hatch, “Deciding to do a qualitative study,” Doing Qualitative Research in Education Settings, pp. 1–35, 2002.

[23] V. Kalra and R. Aggarwal, “Importance of text data preprocessing & Implementation in RapidMiner,” in Proceedings of the

First International Conference on Information Technology and Knowledge Management, Jan. 2018, vol. 14, pp. 71–75, doi:

10.15439/2017KM46.

[24] H. R. Arabnia, K. Daimi, R. Stahlbock, C. Soviany, L. Heilig, and K. Brüssau, “Correction to: principles of data science,” in

Principles of Data Science, Springer, 2020, pp. C1–C1.

[25] A. Naresh and P. V. Krishna, “An efficient approach for sentiment analysis using machine learning algorithm,” Evolutionary

Intelligence, vol. 14, no. 2, pp. 725–731, Jun. 2021, doi: 10.1007/s12065-020-00429-1.

[26] G. Xu, Z. Yu, H. Yao, F. Li, Y. Meng, and X. Wu, “Chinese text sentiment analysis based on extended sentiment dictionary,”

IEEE Access, vol. 7, pp. 43749-43762, 2019.

[27] J. K. Alwan, A. J. Hussain, D. H. Abd, A. T. Sadiq, M. Khalaf, and P. Liatsis, “Political Arabic articles orientation using rough set

theory with sentiment lexicon,” IEEE Access, vol. 9, pp. 24475–24484, 2021, doi: 10.1109/ACCESS.2021.3054919.

[28] A. A. Farisi, Y. Sibaroni, and S. Al Faraby, “Sentiment analysis on hotel reviews using Multinomial Naïve Bayes classifier,”

Journal of Physics: Conference Series, vol. 1192, no. 1, p. 012024, Mar. 2019, doi: 10.1088/1742-6596/1192/1/012024.

[29] Q. B. Baker, F. Shatnawi, and S. Rawashdeh, “Forecasting epidemic diseases with Arabic Twitter data and WHO reports using

machine learning techniques,” Bulletin of Electrical Engineering and Informatics, vol. 11, no. 2, pp. 739–749, Apr. 2022, doi:

10.11591/eei.v11i2.3447.

[30] P. P. M. Surya, L. V. Seetha, and B. Subbulakshmi, “Analysis of user emotions and opinion using Multinomial Naive Bayes

Classifier,” in Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace Technology,

ICECA 2019, Jun. 2019, pp. 410–415, doi: 10.1109/ICECA.2019.8822096.

[31] T. A. Almeida, A. Yamakami, and J. Almeida, “Evaluation of approaches for dimensionality reduction applied with naive bayes

anti-spam filters,” in 2009 International Conference on Machine Learning and Applications, Dec. 2009, pp. 517–522, doi:

10.1109/ICMLA.2009.22.

[32] D. N. Mhawi, “Proposed hybrid correlation feature selection forest panalized attribute approach to advance IDSs,” Modern

Science, vol. 7, no. 4, p. 15, 2021.

[33] D. N. Mhawi, A. Aldallal, and S. Hassan, “Advanced feature-selection-based hybrid ensemble learning algorithms for network

intrusion detection systems,” Symmetry, vol. 14, no. 7, p. 1461, Jul. 2022, doi: 10.3390/sym14071461.

BIOGRAPHIES OF AUTHORS

Murtadha B. Ressan Senior programmer in the Iraqi Federal Ministry of

Construction, Housing and Municipalities, Member of the Committee for the Development

and Modernization of the Governmental Human Resources System. Research interests focus

on developing and modernizing the financial and administrative systems of government

agencies. He has many researches published in the local journal of the University of

Technology. He holds a master's degree in computer science from the University of

Technology. He can be contacted at email: cs.19.02@grad.uotechnology.edu.iq.

Rehab F. Hassan. Assist Professor in Computer Science department at

University of Technology, Iraq. Her researchs interest is in the area of Wireless Sensor

Networks, Mobile Computing, Information Technology, and Intelligent Environment/IoT, and

GIS. She has published more than 50 conference/journal papers. She obtained a PhD, an MSc,

and a BSc, all in Computer Science from university of Technology. She can be contacted at

email: Rehabf.hassan@uotechnology.edu.iq.

Factor analysis influencing Mobile JKN user experience using sentiment analysis

Article

Full-text available

Jun 2024

span lang="EN-US">Social security administration for health or Badan Penyelenggara Jaminan Sosial Kesehatan (BPJS Kesehatan), as a public legal entity, has a critical role in the health of the Indonesian population. BPJS Kesehatan introduced the Mobile national health insurance or jaminan kesehatan nasional (JKN) application to enhance its services, enabling Indonesians to access it directly. Nevertheless, the rating of the Mobile JKN application on the Google Play Store has shown a gradual decline over time. Therefore, this study was conducted to analyze the factors influencing the user experience of the Mobile JKN application, utilizing the review data obtained from the Google Play Store. Sentiment analysis using the Naïve Bayes (NB) classification model and support vector machine (SVM) combined with synthetic minority oversampling technique (SMOTE) and slang word replacement. The results obtained an accuracy value of 93.33%, precision of 93.76%, recall of 93.33%, and F1-score of 93.43%. A further analysis was conducted using online service quality factors to obtain the main factors influencing the experience of Mobile JKN application users. The evaluation findings revealed that factors of security, ease of use, and timeliness are three fundamental aspects that should be given immediate attention by BPJS Kesehatan while improving the Mobile JKN application in the future.</span

Evaluation of Different Stemming Techniques on Arabic Customer Reviews

Article

Full-text available

Feb 2024

Customer opinion and reviews play a vital role in marketing expansion. Big companies all over the world assign a lot of their efforts to analyzing customers’ feedback to keep track of their needs. Natural Language Processing (NLP) is widely used to analyze such review texts. Arabic customer analysis and classification also began to gain researchers’ attention due to the wide range of Arabic language speakers. Working with Arabic Language is a very challenging task because of the orthographic nature of Arabic. Also, customers often write their reviews in their dialectical style, which often diverts from standard Arabic. This study presents a method to classify Arabic customer reviews using four classifiers (K-nearest Neighbor (KNN), Support Vector Machine (SVM), Logistic Regression (RL), and Naïve Bayes (NB)). The classification is implemented with three stemming techniques (Snowball, Khoja, and Tashaphyne). The HARD dataset is adopted to perform the experiments. The results stated that the stemming methods can enhance classification performance despite the complexity of Arabic scripts and dialects. This work sheds light on utilizing and investigating more machine learning (ML) techniques and evaluating the results.

Implementation of Naïve Bayes Algorithm in Sentiment Analysis of Twitter Social Media Users Regarding Their Interest to Pay the Tax

Article

Full-text available

Nov 2023

Since 2008, tax revenue has failed to reach the target set in the State Budget each year. Until 2021, tax revenue managed to reach the target that had been targeted in the 2021 state budget. In the midst of improving tax revenue, towards the end of February 2023, a case involving the son of a Directorate General of Taxes (DGT) that made the father called by the Corruption Eradication Commission (CEC) to be asked for an explanation of his assets. After the case, there were many calls in the community to stop paying taxes, which was assessed by Tauhid Ahmad as Executive Director of Indef as a form of decreased trust in tax collecting institutions. This can affect the amount of revenue from taxes because trust in the government is one of the factors that tend to affect public compliance in paying taxes. Which can affect the amount of revenue from taxes because trust in the government is one of the factors that tend to affect public compliance in paying taxes. One of the crowded calls is the pros and cons of the tax boycott movement on Twitter. With the pros and cWith the pros and cons of the movement that can affect tax revenues on Twitter social media, an assessment based on sentiment analysis is needed which is divided into positive, neutral, or negative categories. Sentiment analysis in this research is carried out using three variations of Naïve Bayes assisted by the TF-IDF word weighting model, namely Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes. Then Confussion Matrix is used to evaluate the model by obtaining the accuracy, precission, recall, and f1-score values and the use of Synthetic Minority Oversampling Technique (SMOTE) to handle unbalanced data. The results of this study on unbalanced data, the implementation of Bernoulli Naïve Bayes using the SMOTE technique on a dataset comparison of 80:20 resulted in better performance than the variations of Gaussian and Multinomial Naïve Bayes with accuracy results of 91.03%, precision, 71.11%, recall 71.43%, and f1-score of 71.18%.

Atrial Fibrillation Detection Through ML Approach: A Comparitive Study

Article

Oct 2023

Rosy Madaan

Atrial Fibrillation (AF) is a common “cardiac arrhythmia” with significant health implications. Traditional AF detection methods have limitations in continuous monitoring and data analysis. The emergence of machine learning (ML) offers promising solutions for accurate and timely AF detection. This study aims to explore and evaluate various ML techniques for AF detection, considering data quality, clinical validation, and algorithm performance. A diverse dataset of ECG signals and patient information is collected and pre-processed for training and testing ML models. The study implements supervised and unsupervised learning algorithms, deep learning (DL) architectures, and ensemble methods to compare their effectiveness in AF detection. Results demonstrate the potential of ML-based AF detection to revolutionize diagnosis and management, leading to improved patient care and healthcare outcomes in cardiology. The results of our comparative study demonstrate that all ML approaches achieved impressive results in detecting AF from ECG signals. “The logistic regression classifier achieved an accuracy of 92.48% and sensitivity of 91.89%. The Naïve Bayes classifier achieved an accuracy of 90.26% and sensitivity of 89.27%. The SVM classifier achieved an accuracy of 93.87% and sensitivity of 92.43%. The Decision tree achieved an accuracy of 93.87% and sensitivity of 90.63%. Finally, the Random Forest model attained an accuracy of 95.8% and sensitivity of 92.88%”.

Sentiment Analysis of IDAHOBIT Celebrations using Naïve Bayes and Decision Tree Algorithms

Article

Full-text available

Mar 2023

The development of LGBTIQ in Indonesia reflects the shift in culture and the emergence of this phenomenon has attracted the attention of the Indonesian people. The use of NLP, ML, and statistics technology in tweet analysis can be used to identify sentiments contained in tweets. This study compares Naïve Bayes algorithm and Decision Tree in sentiment analysis classification, in which the multilingual sentiment analysis method is used in the labeling process of training data. Naïve Bayes results give the best classification with 100% accuracy, precision, and recall, and the number of positive sentiments is 385, negative sentiments are 3117, and neutral sentiments are 899. It looks that the negative class is the most superior compared to other classes. This proves that the Indonesian people have an unfavorable response to the IDAHOBIT celebration.

Development and Comparison of Multiple Emotion Classification Models in Indonesia Text Using Machine Learning

Article

Jan 2024

Analisis Sentimen Film Dirty Vote Menggunakan BERT (Bidirectional Encoder Representations from Transformers)

Article

Full-text available

Apr 2024

Penelitian ini bertujuan untuk melakukan analisis sentimen terhadap ulasan film "Dirty Vote" dari berbagai sumber, seperti media sosial, situs web ulasan film, dan forum online, dengan menggunakan model BERT yang telah di-fine-tuning. Pendekatan ini melibatkan pengumpulan data ulasan, pre-processing data, fine-tuning model BERT, dan evaluasi kinerja model. Hasil penelitian menunjukkan bahwa model BERT mencapai tingkat kinerja yang tinggi dengan akurasi, presisi, recall, dan F1-score yang melebihi ambang batas 0.8 pada dataset validasi. Analisis sentimen dari berbagai sumber mengungkapkan variasi dalam opini publik terhadap film "Dirty Vote", dengan perbedaan yang signifikan dalam sentimen yang diekspresikan melalui media sosial seperti Twitter dan Facebook dibandingkan dengan ulasan dari situs web khusus atau forum online. Selain itu, diskusi temuan analisis sentimen mengungkapkan preferensi masyarakat terhadap aspek-aspek tertentu dari film, seperti efek visual dan musik. Temuan analisis sentimen mengungkapkan bahwa efek visual dan musik mendapat penilaian tertinggi dari masyarakat, sementara pemeran dan sutradara mendapat penilaian yang lebih rendah. Informasi ini dapat digunakan oleh para pembuat film untuk memperbaiki aspek-aspek yang kurang memuaskan dalam produksi film selanjutnya.

Garbage Classification Using Inception V3 as Image Embedding and Extreme Gradient Boosting

Conference Paper

Jan 2024

Aspect-Based Sentiment Classification of Online Product Reviews Using Hybrid Lexicon-Machine Learning Approach

Chapter

Mar 2024

Nowadays, a good number of customers express their experience with online products. These reviews have an important role in customers’ purchase decision process. There may be hundreds or thousands of unstructured and heterogeneous reviews for a popular product. Traditional text processing techniques have limited capability in extracting opinions on customers’ product reviews from huge data over the Internet. Although lexical approaches aim to map words to sentiments by building a lexicon, the process of developing a lexicon with sentiment scores for phrases and sentences becomes tedious and time consuming as data volume increases. Currently, text sentiment analysis requires fast and accurate techniques to decode and quantify the emotion in tweets. This paper presents a hybrid framework based on lexicon and machine learning (ML) algorithms to train previously seen tweets in order to predict the sentiments of some new input tweets into positive, negative, and neutral polarities. Tweepy library was used to extract tweets on Laptop reviews to identify some aspects and classify sentiments towards them into specific polarity. After data pre-processing, the implementation in Python used Natural Language Processing package called TextBlob to assign subjectivity and polarity scores to text. The scores were used by the ML algorithms to analyze and classify sentiments. A dataset of 2226 tweets was used for training and testing Support Vector Machine, Random Forest, and Naïve Bayes classifiers. Results indicate that Random Forest classifier outperforms others in the task of classifying sentiments on Laptop reviews with the highest accuracy (96%), precision (97%), and F1-score (96%).

MACHINE LEARNING-BASED RICE GRAIN CLASSIFICATION THROUGH NUMERICAL FEATURE EXTRACTION FROM RICE IMAGE DATA

Conference Paper

Full-text available

Mar 2023

Rice is a staple food for over half of the world, making it a crucial crop on a global scale. It is a major source of food and income for millions of people and is widely cultivated in many countries. The diversity of rice species and its many uses make it an important crop, both economically and culturally. Accurate classification of rice grains is important in various stages of the rice industry, including quality control, grain sorting, and species identification. Grain morphology plays a vital role in the classification of rice, and classifying rice using traditional methods, which rely on morphological features like grain length, width, weight, and shape, are subject to human error and can be time-consuming and labour-intensive with subjective results. The growing need for accurate and efficient rice species classification has led to the development of machine learning models, which can process large amounts of data and provide accurate results in real-time. In recent years, machine learning models have shown promising results in the classification of different rice species. In this study, we evaluated the performance of several machine learning models, including Support Vector Machines, k-Nearest Neighbors, Stochastic Gradient Descent, Naïve Bayes and Random Forest for classifying different rice species based on numerical features extracted from images of rice grains. The rice (Cammeo and Osmancik) dataset comprises 3810 numerical data, separated into 2180 instances of the Osmancik species and 1630 instances of the Cammeo species. Seven morphological features were identified, namely, the area, perimeter, major axis length, minor axis length, extent, convex area, and eccentricity for each grain of rice. The results show that the Naïve Bayes model had the best performance with the area under the curve of 0.969, and the Stochastic Gradient Descent model achieved the highest performance with a cumulative accuracy of 92.8%.

Advanced Feature-Selection-Based Hybrid Ensemble Learning Algorithms for Network Intrusion Detection Systems

Article

Full-text available

Jul 2022

As cyber-attacks become remarkably sophisticated, effective Intrusion Detection Systems (IDSs) are needed to monitor computer resources and to provide alerts regarding unusual or suspicious behavior. Despite using several machine learning (ML) and data mining methods to achieve high effectiveness, these systems have not proven ideal. Current intrusion detection algorithms suffer from high dimensionality, redundancy, meaningless data, high error rate, false alarm rate, and false-negative rate. This paper proposes a novel Ensemble Learning (EL) algorithm-based network IDS model. The efficient feature selection is attained via a hybrid of Correlation Feature Selection coupled with Forest Panelized Attributes (CFS–FPA). The improved intrusion detection involves exploiting AdaBoosting and bagging ensemble learning algorithms to modify four classifiers: Support Vector Machine, Random Forest, Naïve Bayes, and K-Nearest Neighbor. These four enhanced classifiers have been applied first as AdaBoosting and then as bagging, using the aggregation technique through the voting average technique. To provide better benchmarking, both binary and multi-class classification forms are used to evaluate the model. The experimental results of applying the model to CICIDS2017 dataset achieved promising results of 99.7%accuracy, a 0.053 false-negative rate, and a 0.004 false alarm rate. This system will be effective for information technology-based organizations, as it is expected to provide a high level of symmetry between information security and detection of attacks and malicious intrusion.

Forecasting epidemic diseases with Arabic Twitter data and WHO reports using machine learning techniques

Article

Full-text available

Apr 2022

Twitter is one of the essential social media tools used by many people because they express their views, daily problems, and what they suffer from the health aspects. On Twitter, we can detect and track the spread of the most serious diseases like flu; by analyzing people's tweets and collecting reports from health organizations. In this paper, the data from Twitter was collected in the Arabic language related to the spread of influenza using many Arabic keywords. Then, we applied several machine learning algorithms, which are random forest, multinomial naïve bayes, decision tree, and voting classifier. We also found the correlation between the collected tweets and the reports collected from the World Health Organization (WHO) website according to three experiments. These experiments are: i) between the tweets and reports based on the 13 countries regardless of the time, ii) between the tweets and reports based on the Arab regions that depend on these countries' dialects irrespective of the time, iii) between all tweets and all reports based on the week number. The results from these experiments show that there is a strong correlation between the tweets and the reports, which means that the tweets and the WHO reports can together detect the flu outbreaks in the Arab world.

Proposed Hybrid Correlation Feature Selection Forest Panalized Attribute Approach to advance IDSs

Article

Full-text available

Dec 2021

NetworkIntrusionDetectionSystem(NIDS), widely used network infrastructure. Although many datamining has been used to increase the effectiveness of IDSs, current ID still struggle to perform well. therfore; proposed a new NIDS focused on feature_selection. The proposed CorrelationFeatureSelection_ForestPanalizedAttributes(CFS_FPA) used for dimensionality_reduction and selects the optimal_subset. based on two steps: first check each feature with a target(class) and choose only features that most effective by applying CFS filter using a statistical_method, then applied FPA to select only features will enhance ID and reduce_dimensionality. proposal tested with the NSLKDD experimental results of accuracy 0.997% and 0.004 FAR, wherein UNSWNB15_dataset accuracy and FAR are 0.995%, 0.008 consequently.

Sentiment analysis on twitter tweets about COVID-19 vaccines using NLP and supervised KNN classification algorithm

Article

Full-text available

Jul 2021

The pandemic has taken the world by storm. Almost the entire world went into lockdown to save the people from the deadly COVID-19. Scientists around the around have come up with several vaccines for the virus. Amongthem, Pfizer, Moderna, and AstraZeneca have become quite famous. General people however have been expressing their feelings about the safety and effectiveness of the vaccines on social media like Twitter. In this study, such tweets are being extracted from Twitter using a Twitter API authentication token. The raw tweets are stored and processed using NLP. The processed data is then classified using a supervised KNN classification algorithm. The algorithm classifies the data into three classes, positive, negative, and neutral. These classes refer to the sentiment of the general people whose Tweets are extracted for analysis. From the analysis it is seen that Pfizer shows 47.29%positive, 37.5% negative and 15.21% neutral, Moderna shows 46.16%positive, 40.71% negative, and 13.13% neutral, AstraZeneca shows 40.08%positive, 40.06% negative and 13.86% neutral sentiment.

Data mining technique to analyse and predict crime using crime categories and arrest records

Article

Full-text available

May 2021

Generally, crimes influence organisations as it starts occurring frequently in society. Because of having many dimensions of crime data, it is difficult to mine the available information using off the shelf or statistical data analysis tools. Improving this process will aid the police as well as crime protection agencies to solve the crime rate in a faster period. Also, criminals can often be identified based on crime data. Data mining includes strategies at the convergence of machine learning and database frameworks. Using this concept, we can extract previously unknown useful information and their patterns of occurrence from unstructured data. The sole purpose of this paper is to give an idea of how data mining can be utilised by crime investigation agencies to discover relevant precautionary measures from prediction rates. Data sets are analysed by some supervised classification algorithms, namely decision tree, K-nearest neighbours (KNN) and random forest algorithms. Crime forecasting is done for frequently occurring crimes like robbery, assault, theft, etc. Specifically, the results indicate the superiority of the random forest algorithm in test accuracy.

Sentiment Analysis on COVID-19-Related Social Distancing in Canada Using Twitter Data

Article

Full-text available

Jun 2021
Int J Environ Res Publ Health

Background: COVID-19 preventive measures have been an obstacle to millions of people around the world, influencing not only their normal day-to-day activities but also affecting their mental health. Social distancing is one such preventive measure. People express their opinions freely through social media platforms like Twitter, which can be shared among other users. The articulated texts from Twitter can be analyzed to find the sentiments of the public concerning social distancing. Objective: To understand and analyze public sentiments towards social distancing as articulated in Twitter textual data. Methods: Twitter data specific to Canada and texts comprising social distancing keywords were extrapolated, followed by utilizing the SentiStrength tool to extricate sentiment polarity of tweet texts. Thereafter, the support vector machine (SVM) algorithm was employed for sentiment classification. Evaluation of performance was measured with a confusion matrix, precision, recall, and F1 measure. Results: This study resulted in the extraction of a total of 629 tweet texts, of which, 40% of tweets exhibited neutral sentiments, followed by 35% of tweets showed negative sentiments and only 25% of tweets expressed positive sentiments towards social distancing. The SVM algorithm was applied by dissecting the dataset into 80% training and 20% testing data. Performance evaluation resulted in an accuracy of 71%. Upon using tweet texts with only positive and negative sentiment polarity, the accuracy increased to 81%. It was observed that reducing test data by 10% increased the accuracy to 87%. Conclusion: Results showed that an increase in training data increased the performance of the algorithm.

Twitter Sentiment Analysis towards COVID-19 Vaccines in the Philippines Using Naïve Bayes

Article

Full-text available

May 2021

A year into the COVID-19 pandemic and one of the longest recorded lockdowns in the world, the Philippines received its first delivery of COVID-19 vaccines on 1 March 2021 through WHO’s COVAX initiative. A month into inoculation of all frontline health professionals and other priority groups, the authors of this study gathered data on the sentiment of Filipinos regarding the Philippine government’s efforts using the social networking site Twitter. Natural language processing techniques were applied to understand the general sentiment, which can help the government in analyzing their response. The sentiments were annotated and trained using the Naïve Bayes model to classify English and Filipino language tweets into positive, neutral, and negative polarities through the RapidMiner data science software. The results yielded an 81.77% accuracy, which outweighs the accuracy of recent sentiment analysis studies using Twitter data from the Philippines.

Framing Twitter Public Sentiment on Nigerian Government COVID-19 Palliatives Distribution Using Machine Learning

Article

Full-text available

Mar 2021

Abstract: Sustainable development plays a vital role in information and communication technology. In times of pandemics such as COVID-19, vulnerable people need help to survive. This help includes the distribution of relief packages and materials by the government with the primary objective of lessening the economic and psychological effects on the citizens affected by disasters such as the COVID-19 pandemic. However, there has not been an efficient way to monitor public funds’ accountability and transparency, especially in developing countries such as Nigeria. The understanding of public emotions by the government on distributed palliatives is important as it would indicate the reach and impact of the distribution exercise. Although several studies on English emotion classification have been conducted, these studies are not portable to a wider inclusive Nigerian case. This is because Informal Nigerian English (Pidgin), which Nigerians widely speak, has quite a different vocabulary from Standard English, thus limiting the applicability of the emotion classification of Standard English machine learning models. An Informal Nigerian English (Pidgin English) emotions dataset is constructed, pre-processed, and annotated. The dataset is then used to classify five emotion classes (anger, sadness, joy, fear, and disgust) on the COVID-19 palliatives and relief aid distribution in Nigeria using standard machine learning (ML) algorithms. Six ML algorithms are used in this study, and a comparative analysis of their performance is conducted. The algorithms are Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Random Forest (RF), Logistics Regression (LR), K-Nearest Neighbor (KNN), and Decision Tree (DT). The conducted experiments reveal that Support Vector Machine outperforms the remaining classifiers with the highest accuracy of 88%. The “disgust” emotion class surpassed other emotion classes, i.e., sadness, joy, fear, and anger, with the highest number of counts from the classification conducted on the constructed dataset. Additionally, the conducted correlation analysis shows a significant relationship between the emotion classes of “Joy” and “Fear”, which implies that the public is excited about the palliatives’ distribution but afraid of inequality and transparency in the distribution process due to reasons such as corruption. Conclusively, the results from this experiment clearly show that the public emotions on COVID-19 support and relief aid packages’ distribution in Nigeria were not satisfactory, considering that the negative emotions from the public outnumbered the public happiness.

A novel fusion-based deep learning model for sentiment analysis of COVID-19 tweets

Article

Jun 2021
KNOWL-BASED SYST

Undoubtedly, coronavirus (COVID-19) has caused one of the biggest challenges of all times. The ongoing COVID-19 pandemic has caused more than 150 million infected cases and one million deaths globally as of May 5, 2021. Understanding the sentiment of people expressed in their social media comments can help in monitoring, controlling, and ultimately eradicating the disease. This is a sensitive matter as the threat of infectious disease significantly affects the way people think and behave in various ways. In this study, we proposed a novel method based on the fusion of four deep learning and one classical supervised machine learning model for sentiment analysis of coronavirus-related tweets from eight countries. Also, we analyzed coronavirus-related searches using Google Trends to better understand the change in the sentiment pattern at different times and places. Our findings reveal that the coronavirus attracted the attention of people from different countries at different times in varying intensities. Also, the sentiment in their tweets is correlated to the news and events that occurred in their countries including the number of newly infected cases, number of recoveries and deaths. Moreover, common sentiment patterns can be observed in various countries during the spread of the virus. We believe that different social media platforms have great impact on raising people’s awareness about the importance of this disease as well as promoting preventive measures among people in the community.

Political Arabic Articles Classification Based on Machine Learning and Hybrid Vector

Conference Paper

Nov 2020

Naïve-Bayes family for sentiment analysis during COVID-19 pandemic and classification tweets

Abstract

Recommended publications

Depression Detection During the Covid 19 Pandemic by Machine Learning Techniques

Improving Machine Learning Performance by Eliminating the Influence of Unclean Data

Improving Machine Learning Performance by Eliminating the Influence of Unclean Data

Public Sentiment Analysis of Indonesian Tweets About COVID-19 Vaccination Using Different Machine Le...

Sentiment Analysis on MySejahtera Application during COVID-19 Pandemic