Conference PaperPDF Available

Word Embedding based News Classification by using CNN

August 2021

August 2021

DOI:10.1109/ICSECS52883.2021.00117

Conference: 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM)

Authors:

Faisal Ahmed

Premier University

Nazma Akther

Premier University

Mohammad Hasan

Adam Mickiewicz University

Show all 5 authorsHide

Block Diagram of Proposed Method

…

Architecture of the CNN model to classify news categories

…

Class-wise receiver operating characteristics curve of proposed method

…

Figures - uploaded by Nazma Akther

Content may be subject to copyright.

Content uploaded by Nazma Akther

Content may be subject to copyright.

Content uploaded by Faisal Ahmed

Content may be subject to copyright.

Word Embedding based News Classiﬁcation by

using CNN

Faisal Ahmed

Department of CSE

Premier University

Chattogram, Bangladesh

faisalcsecubd@gmail.com

Nazma Akther

Deptment of CSE

United International University (UIU)

Dhaka, Bangladesh

nazmacse2013@gmail.com

Mohammad Hasan

Department of CSE

Premier University

Chattogram, Bangladesh

mehedih256@gmail.com

Kibtia Chowdhury

Department of CSE

United International University (UIU)

Dhaka, Bangladesh

kchowdhury211056@mscse.uiu.ac.bd

Md. Saddam Hossain Mukta

Department of CSE

United International University (UIU)

Dhaka, Bangladesh

saddam@cse.uiu.ac.bd

Abstract—In this era of information technology, the number of

online news portal is increasing day by day. These online news

portals make a good proﬁt by advertising different consumer

products to their reader. However, due to the lack of intelligence,

traditional news portals cannot identify what types of news are

preferred by the users. As a consequence, these news portals

most of the time show irrelevant advertisements to the readers

and incur a great economic loss to the advertisers. If these news

portals can identify what type of news a user is reading, then they

can provide contextual advertisements (showing advertisements

of news-related products) and gain more proﬁt. Therefore, in

this paper, we proposed a method integrating word embedding

with Convolutional Neural Network (CNN) for the classiﬁcation

of English news into four different categories: Sports, Business,

National and International. The performance of the proposed

method is evaluated on our prepared dataset in terms of macro-

f1 and micro-f1 scores. The experimental result shows that our

proposed method achieved macro-f1 and micro-f1 scores of 0.90

and 0.89, respectively which are signiﬁcantly higher than that of

all the baseline methods.

Keywords—News Classiﬁcation, Word Embedding, CNN, BoW,

Contextual Marketing, Machine learning

I. INTRODUCTION

With the amelioration of information technology, today mas-

sive amount of information have been stored in the electronic

form. Since vast amounts of news are available on the Internet,

it becomes a time consuming process for the people to access

the interesting one. In the ﬁnancial sector news events are the

critical factor which change the ﬁnancial market positively or

negatively. Therefore, classiﬁcation of news is a very crucial

step to allow the user to enter their news of interest quickly and

effectively. Besides, news articles become a contemporary is-

sue for the company managers, policy makers and investors to

make better decisions. Usually people are interested in reading

news to acquire some information about his or her preferred

area from all over the world. However, News Classiﬁcation is

a very challenging task in the ﬁeld of text mining due to the

availability of the news in digital media. It is very tough for

editors to formulate structured information from unstructured

news data.

To develop such an automated news classiﬁcation system

many researchers dedicated their work to classify the news

content. There are many techniques applied to the news corpus

for the classiﬁcation of categories of textual data [1]. In

addition, many researchers work with social media textual

data to predict movie preference of viewers [2]. Social media

data is also used to classify the people’s choice based on

their behavior and values [3], [4]. Moreover, various machine

learning and deep learning models including Na¨

ıve Bayes [5],

neural networks [6], SVM [7][8] have been implemented

in order to classify the news category.In this study, we

have proposed a novel method to classify the multiple news

categories like Business, Sports, National and International

which are collected from Bangladeshi news portals. The main

contribution of this study is that dataset have been self-created

by scraping from different online news portals.

After that , to achieve our goal we used word embedding

with CNN [9]. The motivation of our proposed method is to

develop a content based news recommendation system which

suggests the user to select the relevant news from a huge

number of news. Besides, prepare a contextual advertising

which shows the meaningful advertisement to the end user .

For example, it is more meaningful to show advertisements of

sports materials than beauty products while someone is reading

news on sports.

The organization of this paper is as follows. Section-II

presents the literature review of this study, Section-III de-

scribes the data collection process from the scratch where

data is collected by web scraping, Section-IV describes the

proposed methodology of our study, and Section-V includes

our experimental results. Finally, Section-VI concludes the

paper.

609

2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on

Computational Science and Information Management (ICSECS-ICOCSIM)

DOI 10.1109/ICSECS52883.2021.00117

2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM) | 978-1-6654-1407-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICSECS52883.2021.00117

II. LITERATURE REVIEW

A signiﬁcant number of studies have been conducted on

online news classiﬁcation using different types of multiclass

classiﬁers and now-a-days deep learning approaches are also

becoming popular for text mining or news categorization.

Researches have been done for English language news cat-

egorization and also for other languages. An intelligent web

news classiﬁcation system has been proposed by Krishnalal G

et al. [1] where they used Hidden Markov Model (HMM) and

Support Vector Machine (SVM) to classify three categories

which are sports, ﬁnance and politics. They have collected

data from ﬁve popular Indian newspapers. Also a comparison

has been shown among KNN, SVM, and HMM-SVM.

Liliana et al. [7] proposed a machine learning model namely,

Support Vector Machine (SVM) to categorize the Indonesian

news with the accuracy of 85%. Umid Suleymanov et al. [10]

have designed a text classiﬁcation system based on Na¨

ıve

Bayes, Support Vector Machine (SVM), and Artiﬁcial Neural

Network for categorizing Azerbaijani news articles. They

have formed a new text corpus and term frequency inverse

document frequency (TF-IDF) has been used for converting

the text to vector. H. Duong et al. [11] have summarized

the popular multi-class classiﬁers like K-NN, Na¨

ıve Bayes,

Logistic Regression, Decision Tree, Random forest, SVM,

OVO, and OVA and applied them on a new benchmark

dataset for Vietnamese news categorization belonging to 25

classes. TF-IDF has been used for feature extraction. They

have obtained the best result using the One-VS-All (OVA)

classiﬁer comparing with other classiﬁers.Beside other lan-

guage news classiﬁcation, some researchers have worked on

Bangla document classiﬁcation. Md. Saiful Islam et al. [12]

have shown a comparative study among three supervised

machine learning algorithms e.g. SVM, Na¨

ıve Bayes, SGD for

Bangla document categorization. They have used Chi Square

distribution and TF-IDF for feature selection and also explored

these two techniques with the above mentioned three machine

learning algorithms. They have found SVM with TF-IDF

for feature selection gave the best result. Similarly, Shahi et

al. [13] have proposed a Nepali news classiﬁcation system

using SVM, Na¨

ıve Bayes, and Neural Networks. They also

extracted feature using TF-IDF.

Beakcheol et al. [14] presented a robust Word2vec CNN

classiﬁcation model to classify the news of articles and tweets.

They implemented two types of word embedding methods

such as CBOW (Continuous Bag-of-word) and Skip-gram

with deep neural network CNN. The experimental results

concluded that CBOW with CNN works better to classify

news articles. On the other hand, Skip-gram with CNN works

better for Tweets.Authors in [15] classify the news headline

of Roma-urdu language with the accuracy of 93.5%. They

observed that the SGD algorithm perfectly classiﬁes the class

better than other machine learning algorithms. Furthermore,

other researchers have shown different techniques for news

classiﬁcation or text mining like M.P. Akhter et al. [16]

have used Single-layer Multisize Filters Convolutional Neural

Network (SMFCNN) for document-level text classiﬁcation and

Ali Ramdhani et al. [17] have classiﬁed the the Indonesian

news using CNN.

III. DATA COLLECTION

Data used in this experiment are collected from several

popular Bangladeshi daily English newspapers including the

Daily Star 1and the Daily Sun 2, etc. A python based web

scraper is built to gather the news titles, and body contents

labeled with the category name. We use BeautifulSoup which

is a built in python package to scrape the news contents from

the websites. Although newspaper websites have different web

structures to represent the news articles, we have designed a

generic algorithm so that we can collect data from various

newspaper sites. A pseudo code of the web scraper used in

this study is shown in Algorithm 1.

Algorithm 1 The pseudo code of proposed news scraper

1: Create the URL with date

2: Request the URL and save the page content

3: Create a soup using the BeatifulSoup library from the page

content

4: From the soup, ﬁnd all ‘a’ html tags

5: for i in all ‘a’ tags do

6: Find the ‘href’ from i

7: Request and ﬁnd the page content using the ’href’

8: Find the category and append in all_categories

list

9: Find the title and append in all_titles list

10: Create the soup and ﬁnd the body of the news

11: Find all ‘p’ tags from the body

12: for p in body do

13: Get the text from p and save in a list

14: Merge all the text in a single paragraph list

15: end for

16: Append the paragraph in news_contents list

17: end for

All the news articles belong to four categories and the

number of instances in each category is given in table I.

TABLE I. Number of News in Different Categores

Category Number of news

National 397

International 267

Sports 188

Business 447

IV. METHODOLOGY

Our proposed method consist of three steps- Pre-processing,

Text Representation and Classiﬁcation using CNN as shown in

Fig. 1

1https://thedailystar.net

2https://www.daily-sun.com

610

Fig. 1: Block Diagram of Proposed Method

A. Pre-processing

In the preprocessing step, all the texts in the news are

converted to lowercase. Then we remove punctuation marks,

digits and extra white spaces using regular expression. Ad-

ditionally, the stop words, the most common words in the

English language like “the”, “a”, “on”, “is”, “all” which do not

carry important meaning, are eliminated. We also performed

lemmatization to reduce inﬂectional forms of a word to a

common base form. All these tasks in preprocessing step

are performed using NLTK. A dictionary of terms is then

constructed considering all the news in the dataset. A special

token [PAD] is added in the dictionary for padding. In the

dictionary, the index 0 is kept reserved for the [PAD] token

and not assign to any word. For CNN, all the input text needs

to be of ﬁxed length. Therefore, the news whose length is less

than the maximum length of news in the dataset is padded with

the [PAD] token. Each news is then converted to the sequence

of integers where each integer is the index of a token in a

dictionary. The ground truth of each news is represented in

one hot encoded form.

B. Text Representation

The dataset used in our study is small in size. Therefore,

after the preprocessing steps, the news is represented using

the word embedding method. Word embedding can capture

the context of a word in a document, semantic and syntactic

similarity as well as the relation with other words in the news.

Each vocabulary in the news is represented as a continuous

vector of 300 dimensions obtained from the pre-trained word

vectors trained on Wikipedia data using the skip-gram model.

The maximum length of the news in our training corpus is

150. Hence, each news in the corpus is represented as a

word embedding matrix of size 150x300 and then fed into

the network.

C. Classiﬁcation of news using CNN

Recently, deep learning gains much popularity for text clas-

siﬁcation. The popular deep learning classiﬁcation techniques

include CNN, gated recurrent unit (GRU) [18], LSTM [19],

and random weighted LSTM (RWL) [20]. To the best of

our knowledge, CNN has not been explored so far for news

classiﬁcation. Therefore, CNN is used in this study for text

feature extraction and news classiﬁcation. CNN is a feed

forward network model structure based on an artiﬁcial neural

network consisting of an input layer, a hidden layer, and

an output layer. The hidden layer in CNN is divided into

convolution layer and pooling layer to learn and extract images

or text features. The output layer is a fully connected layer that

performs the classiﬁcation.

The CNN model used in this study consists of an initial

embedding layer that maps input news into a matrix followed

by a convolution layer of 32 ﬁlters each having kernel of

size 3 and a relu activation layer. The convolution layer or

feature extractor layer performs the convolution operation

by calculating the dot product between the kernel and the

receptive ﬁeld of the input matrix. The process repeats until

the whole matrix is traversed and the output is input to the

relu activation layer.

The activation layer introduces the property of non-linearity

into the model. We used the relu activation function since it

makes the training process of the model easier and improves

the generalization performance compared to other activation

functions. After the activation layer, we applied a max-pooling

layer to downsample the feature maps which is further fol-

lowed by a ﬂatten layer. Finally, to perform the classiﬁcation,

a dense layer with a size of 4 is appended in the network to

represent the number of news classes with a softmax func-

tion. The softmax function determines the output probability

distribution of the four news classes. The architecture of the

network is visualized in the Fig. 2

The network is trained for 20 epochs with a batch size of

16 using adam optimization method to minimize categorical

cross-entropy loss shown in Eq. 1.

L(θ)=−



i=1

yilog(ˆyi)(1)

where Cis the number of target classes, yis the one

hot representation of the ground truth and ˆyis the estimated

probability distribution assigned to the news classes by the

model.

V. E XPERIMENTS AND RESULTS

In this study, a novel method is proposed for the clas-

siﬁcation of English news into 4 different groups - Sports,

Country, World and Business from the news title and news

body using pre-trained word embedding and convolutional

neural network. The proposed method is compared with six

traditional machine learning algorithms including random for-

est (RF) [21], adaptive boosting (AdaBoost) [22], gradient

boosting tree (GBT) [23], decision tree (DT) [24], support

vector machine (SVM) [25] and k-nearest neighbour (kNN)

[26] as the baseline methods to measure its performance.

In the baseline methods, we obtain the Bag-of-Words (Bow)

representation of each news and calculate the term frequency

611

Fig. 2: Architecture of the CNN model to classify news categories

and inverse document frequency (TF-IDF) for each term in

the BoW representation. Next, the most signiﬁcant terms

for the classiﬁcation of news into four groups are selected

using the Analysis of Variance (ANOVA) hypothesis testing

method. The hypothesis testing is done on the TF-IDF of each

term(feature) for 4 different news categories. Based on the

test statistics we have selected 647 terms whose p-value are

less than the signiﬁcance level of 0.01 and train the machine

learning classiﬁers. For both the proposed and the baseline

methods, we divide the dataset into train and test sets with

a ratio of 8:2. We train the models on the training set and

measure the performance on the test set. For both the proposed

and the baseline methods, we divide the dataset into train and

test sets with a ratio of 8:2. We train the model on the training

set and measure the performance on the test set.

Table II shows the results of the experiments. We used

macro and micro f1-score as evaluation measures in this paper.

Macro f1-score is an arithmetic mean of the per-class f1-score

while micro f1-score is computed combining micro-precision

and micro- recall over all the samples. According to the results,

our proposed method performs signiﬁcantly better than the

baseline method in terms of both macro f1-score and micro f1-

score which are 90% and 89% respectively. The main reason

behind the better performance of the proposed method is that

we used pre-trained word embeddings for text representation

which can represent the semantic and syntactic relationship

among the words. On the other hand, BoW representation is

used in the baseline method which generates sparse vectors

in case of limited data. Therefore, the classiﬁer cannot learn

the non-linear relationship between the news and category

properly. Consequently, the baseline method shows lower

performance than the proposed method.

TABLE II. News Classiﬁcation Result Measured by Precision,

Recall and F-score

Macro

Pre.

Micro

Pre.

Macro

Rec.

Micro

Rec.

Macro

Micro

Proposed 0.91 0.89 0.89 0.89 0.90 0.89

RF+BoW 0.88 0.86 0.85 0.85 0.86 0.85

AdaBoost+BoW 0.69 0.65 0.64 0.62 0.65 0.63

GBT+BoW 0.75 0.73 0.68 0.69 0.70 0.70

DT+BoW 0.68 0.67 0.65 0.66 0.66 0.66

SVM+BoW 0.89 0.87 0.84 0.85 0.86 0.86

kNN+BoW 0.87 0.85 0.86 0.85 0.86 0.85

The class wise precision, recall f1-score of the proposed

method are shown in the Table III

TABLE III. Precision, recall and f1-score for each class

obtained using proposed method

Precision Recall F-score

Sports 1.00 0.85 0.92

National 0.81 0.89 0.85

International 0.93 0.95 0.94

Business 0.90 0.87 0.89

As we can see from the Table III, the proposed method

achieves the highest performance in estimating the Sports

category and exhibits the lowest performance in identifying

the National category of news.

For a better visual representation, we also show the class-

wise ROC curves of the proposed method in Fig. 3 From

the ﬁgure, we can see that for all classes the area under the

curves (AUCs) is signiﬁcantly high which demonstrates the

effectiveness of our proposed method in news classiﬁcation.

Fig. 3: Class-wise receiver operating characteristics curve of

proposed method

VI. CONCLUSION

In this study, a method is proposed for news classiﬁcation

from news title and body using word embedding and CNN

model. To measure the performance of the method, a dataset

is prepared consisting of four different categories of news. The

612

result shows that the proposed method outperforms the base-

line methods to a signiﬁcant margin. The proposed method can

be used in online news portals for contextual advertisements.

In the future, we plan to increase the size of our dataset as

well as add more news categories. Moreover, extending the

proposed method for the Bangali news can be investigated.

REFERENCES

[1] G. Krishnalal, S. B. Rengarajan, and K. Srinivasagan, “A new text

mining approach based on hmm-svm for web news classiﬁcation,”

International Journal of Computer Applications, vol. 1, no. 19, pp. 98–

104, 2010.

[2] E. M. Khan, M. S. H. Mukta, M. E. Ali, and J. Mahmud, “Predicting

users’ movie preference and rating behavior from personality and

values,” ACM Transactions on Interactive Intelligent Systems (TiiS),

vol. 10, no. 3, pp. 1–25, 2020.

[3] M. M. Rahman, M. T. H. Majumder, M. S. H. Mukta, M. E. Ali,

and J. Mahmud, “Can we predict eat-out preference of a person from

tweets?,” in Proceedings of the 8th ACM Conference on Web Science,

pp. 350–351, 2016.

[4] M. S. H. Mukta, A. S. Sakib, M. A. Islam, M. E. Ali, M. Ahmed, and

M. A. Rifat, “Friends’ inﬂuence driven users’ value change prediction

from social media usage,” SBP-BRiMS, 2021.

[5] G. Septian, A. Susanto, and G. F. Shidik, “Indonesian news classiﬁcation

based on nabana,” in 2017 International Seminar on Application for

Technology of Information and Communication (iSemantic), pp. 175–

180, IEEE, 2017.

[6] S. Kaur and N. K. Khiva, “Online news classiﬁcation using deep

learning technique,” International Research Journal of Engineering and

Technology (IRJET), vol. 3, no. 10, pp. 558–563, 2016.

[7] D. Y. Liliana, A. Hardianto, and M. Ridok, “Indonesian news clas-

siﬁcation using support vector machine,” World Academy of Science,

Engineering and Technology, vol. 57, pp. 767–770, 2011.

[8] I. Dilrukshi, K. De Zoysa, and A. Caldera, “Twitter news classiﬁcation

using svm,” in 2013 8th International Conference on Computer Science

& Education, pp. 287–291, IEEE, 2013.

[9] P. Kim, “Convolutional neural network,” in MATLAB deep learning,

pp. 121–147, Springer, 2017.

[10] U. Suleymanov, S. Rustamov, M. Zulfugarov, O. Orujov, N. Musayev,

and A. Alizade, “Empirical study of online news classiﬁcation using

machine learning approaches,” in 2018 IEEE 12th International Confer-

ence on Application of Information and Communication Technologies

(AICT), pp. 1–6, IEEE, 2018.

[11] H.-T. Duong and V. T. Hoang, “A survey on the multiple classiﬁer for

new benchmark dataset of vietnamese news classiﬁcation,” in 2019 11th

International Conference on Knowledge and Smart Technology (KST),

pp. 23–28, IEEE, 2019.

[12] M. Islam, F. E. M. Jubayer, S. I. Ahmed, et al., “A comparative study

on different types of approaches to bengali document categorization,”

arXiv preprint arXiv:1701.08694, 2017.

[13] T. B. Shahi and A. K. Pant, “Nepali news classiﬁcation using na¨

ıve

bayes, support vector machines and neural networks,” in 2018 Inter-

national Conference on Communication Information and Computing

Technology (ICCICT), pp. 1–5, IEEE, 2018.

[14] B. Jang, I. Kim, and J. W. Kim, “Word2vec convolutional neural

networks for classiﬁcation of news articles and tweets,” PloS one,

vol. 14, no. 8, p. e0220976, 2019.

[15] S. M. Hassan, F. Ali, S. Wasi, S. Javeed, I. Hussain, and S. N. Ashraf,

“Roman-urdu news headline classiﬁcation with ir models using machine

learning algorithms,” Indian Journal of Science and Technology, vol. 12,

no. 35, pp. 1–9, 2019.

[16] M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood,

and M. T. Sadiq, “Document-level text classiﬁcation using single-layer

multisize ﬁlters convolutional neural network,” IEEE Access, vol. 8,

pp. 42689–42707, 2020.

[17] M. A. Ramdhani, D. S. Maylawati, and T. Mantoro, “Indonesian news

classiﬁcation using convolutional neural network,” Indonesian Journal of

Electrical Engineering and Computer Science, vol. 19, no. 2, pp. 1000–

1009, 2020.

[18] K. Cho, B. Van Merri¨

enboer, C. Gulcehre, D. Bahdanau, F. Bougares,

H. Schwenk, and Y. Bengio, “Learning phrase representations using

rnn encoder-decoder for statistical machine translation,” arXiv preprint

arXiv:1406.1078, 2014.

[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural

computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[20] A. S. Al Raﬁ, T. Rahman, A. R. Al Abir, T. A. Rajib, M. Islam, and

M. S. H. Mukta, “A new classiﬁcation technique: random weighted lstm

(rwl),” in 2020 IEEE Region 10 Symposium (TENSYMP), pp. 262–265,

IEEE, 2020.

[21] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–

32, 2001.

[22] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,”

Journal-Japanese Society For Artiﬁcial Intelligence, vol. 14, no. 771-

780, p. 1612, 1999.

[23] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics

& data analysis, vol. 38, no. 4, pp. 367–378, 2002.

[24] Y.-Y. Song and L. Ying, “Decision tree methods: applications for

classiﬁcation and prediction,” Shanghai archives of psychiatry, vol. 27,

no. 2, p. 130, 2015.

[25] C. Cortes, “Wsupport-vector network,” Machine Learning, vol. 20,

pp. 1–25, 1995.

[26] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2,

p. 1883, 2009.

613

Unifying Sentence Transformer Embedding and Softmax Voting Ensemble for Accurate News Category Prediction

Article

Full-text available

Jul 2023

The study focuses on news category prediction and investigates the performance of sentence embedding of four transformer models (BERT, RoBERTa, MPNet, and T5) and their variants as feature vectors when combined with Softmax and Random Forest using two accessible news datasets from Kaggle. The data are stratified into train and test sets to ensure equal representation of each category. Word embeddings are generated using transformer models, with the last hidden layer selected as the embedding. Mean pooling calculates a single vector representation called sentence embedding, capturing the overall meaning of the news article. The performance of Softmax and Random Forest, as well as the soft voting of both, is evaluated using evaluation measures such as accuracy, F1 score, precision, and recall. The study also contributes by evaluating the performance of Softmax and Random Forest individually. The macro-average F1 score is calculated to compare the performance of different transformer embeddings in the same experimental settings. The experiments reveal that MPNet versions v1 and v3 achieve the highest F1 score of 97.7% when combined with Random Forest, while T5 Large embedding achieves the highest F1 score of 98.2% when used with Softmax regression. MPNet v1 performs exceptionally well when used in the voting classifier, obtaining an impressive F1 score of 98.6%. In conclusion, the experiments validate the superiority of certain transformer models, such as MPNet v1, MPNet v3, and DistilRoBERTa, when used to calculate sentence embeddings within the Random Forest framework. The results also highlight the promising performance of T5 Large and RoBERTa Large in voting of Softmax regression and Random Forest. The voting classifier, employing transformer embeddings and ensemble learning techniques, consistently outperforms other baselines and individual algorithms. These findings emphasize the effectiveness of the voting classifier with transformer embeddings in achieving accurate and reliable predictions for news category classification tasks.

Machine Learning-Based Tomato Leaf Disease Diagnosis Using Radiomics Features

Chapter

Full-text available

May 2023

Tomato leaves can be infected with various infectious viruses and fungal diseases that drastically reduce tomato production and incur a great economic loss. Therefore, tomato leaf disease detection and identification are crucial for maintaining the global demand for tomatoes for a large population. This paper proposes a machine learning-based technique to identify diseases on tomato leaves and classify them into three diseases (Septoria, Yellow Curl Leaf, and Late Blight) and one healthy class. The proposed method extracts radiomics-based features from tomato leaf images and identifies the disease with a gradient boosting classifier. The dataset used in this study consists of 4000 tomato leaf disease images collected from the Plant Village dataset. The experimental results demonstrate the effectiveness and applicability of our proposed method for tomato leaf disease detection and classification.KeywordsTomato leaf diseaseMachine learningRadiomics featuresClassification

SkinNet-8: An Efficient CNN Architecture for Classifying Skin Cancer on an Imbalanced Dataset

Conference Paper

Apr 2023

Skin cancer is a fatal disease that has become the leading cause of death worldwide in recent years, although it is curable if diagnosed early. Early skin cancer detection significantly improves patients' chances of survival and reduces mortality. In this research, we conduct experiments on a high imbalance dermoscopic ISIC 2020 dataset. The primary objective of this study is to develop a shallow CNN architecture to complete the classification task effectively, requiring fewer computational resources without compromising accuracy. We have used pre-processing techniques to remove image noise and truncation and augmentation techniques to balance the dataset before fitting it into the model. Multiple performance measurement metrics were utilized to establish the overall performance. Our proposed model yields a remarkable test accuracy of 98.81%. We compare our models' performance with different transfer learning (TL) models to assess the faster convergence rate. The proposed model demonstrated its robustness by outperforming the other TL models in terms of accuracy within a short processing time. It is reasonable to assume that our proposed system will reliably aid dermatologists in diagnosing skin cancer patients early and increasing survival rates.

SkinNet-8: An Efficient CNN Architecture for Classifying Skin Cancer on an Imbalanced Dataset

Conference Paper

Full-text available

Apr 2023

A Dynamic Weighted Tabular Method for Convolutional Neural Networks

Article

Full-text available

Jan 2022

Traditional Machine Learning (ML) models are generally preferred for classification tasks on tabular datasets, which often produce unsatisfactory results in complex tabular datasets. Recent works, using Convolutional Neural Networks (CNN) with embedding techniques, outperform the traditional classifiers on tabular dataset. However, these embedding techniques fail to use an automated approach after analyzing the importance of the features in the dataset accurately. This study introduces a novel feature embedding technique named Dynamic Weighted Tabular Method (DWTM), which dynamically uses feature weights based on their strength of the correlations to the class labels during applying any CNN architectures on the tabular datasets. DWTM converts each data point into images and then feeds to a CNN architecture. It dynamically embeds the features of the tabular dataset based on their strength and assigns pixel positions to the appropriate features in the image canvas space instead of using any static configuration. In this paper, DWTM embedding method is applied over six benchmark tabular datasets independently by using three different CNN architectures (i.e., ResNet-18, DenseNet and InceptionV1) and an outstanding performance (an average accuracy of 98%) has obtained, which outperforms any traditional and CNN based classifiers as well.

Friends' Influence Driven Users' Value Change Prediction from Social Media Usage

Conference Paper

Full-text available

Jul 2021

Basic human values represent a set of values such as security, independence, success, kindness, and pleasure, which we deem important to our lives. The value priority of a person may change over time due to different factors such as life experiences, influence, social structure and technology. In this study, we show that we can predict the value change of a person by considering both the influence of her friends and her social media usage. This is the first work in the literature that relates the influence of social media friends on the human value dynamics of a user. We propose a Bounded Confidence Model (BCM) based value dynamics model from 275 different ego networks in Facebook that predicts how social influence may persuade a person to change her value over time. Then, to predict better, we use a particle swarm optimization based hyperparameter tuning technique. We observe that these optimized hyperparameters produce more accurate future value score. We also run our approach with different machine learning based methods and find support vector regressor (SVR) outperforms other regressor models. By using SVR with the best hyperparameters of BCM model, we find the lowest Mean Squared Error (MSE) score as 0.00347.

A New Classification Technique: Random Weighted LSTM (RWL)

Article

Full-text available

Jun 2020

Due to the unprecedented growth of digital devices and hand held smartphones, users generate several quintillion of data everyday by using social media, blog, youtube, etc. With the advancement of machine learning techniques, we classify these data automatically for critical decision making process. Majority of these classification algorithms are linear in nature and these algorithms show weak performance in predicting class labels when the attributes are complex. For example, human behavior, preference, personality, etc. have numerous non-linear properties and difficult to predict in real life by using these traditional machine learning algorithms. In this paper, we propose a novel non-linear technique based on Long short-term memory (LSTM) architecture. Studies show that Recurrent Neural Network (RNN) and LSTM based models usually predict time and sequential models better than that of other models. We significantly change the operational mechanism of LSTM and achieve outstanding performance in predicting classification problems. We run our algorithm over six different datasets: Iris, Pima Indian, Breast Cancer, Blood Transfusion, StackOverflow, and Banknote Authentication. We compare the performance of our algorithm with other traditional classifiers. Our classifier generally outperforms conventional linear and non-linear classifiers.

Predicting Users' Movie Preference and Rating Behavior from Personality and Values

Article

Full-text available

Oct 2020

In this article, we propose novel techniques to predict a user's movie genre preference and rating behavior from her psycholinguistic attributes obtained from the social media interactions. The motivation of this work comes from various psychological studies that demonstrate that psychological attributes such as personality and values can influence one's decision or choice in real life. In this work, we integrate user interactions in Twitter and IMDb to derive interesting relations between human psychological attributes and their movie preferences. In particular, we first predict a user's movie genre preferences from the personality and value scores of the user derived from her tweets. Second, we also develop models to predict user movie rating behavior from her tweets in Twitter and movie genre and storyline preferences from IMDb. We further strengthen the movie rating model by incorporating the user reviews. In the above models, we investigate the role of personality and values independently and combinedly while predicting movie genre preferences and movie rating behaviors. We find that our combined models significantly improve the accuracy than that of a single model that is built by using personality or values independently. We also compare our technique with the traditional movie genre and rating prediction techniques. The experimental results show that our models are effective in recommending movies to users.

Indonesian news classification using convolutional neural network

Article

Full-text available

Aug 2020

span>Every language has unique characteristics, structures, and grammar. Thus, different styles will have different processes and result in processed in Natural Language Processing (NLP) research area. In the current NLP research area, Data Mining (DM) or Machine Learning (ML) technique is popular, especially for Deep Learning (DL) method. This research aims to classify text data in the Indonesian language using Convolutional Neural Network (CNN) as one of the DL algorithms. The CNN algorithm used modified following the Indonesian language characteristics. Thereby, in the text pre-processing phase, stopword removal and stemming are particularly suitable for the Indonesian language. The experiment conducted using 472 Indonesian News text data from various sources with four categories: ‘hiburan’ (entertainment), ‘olahraga’ (sport), ‘tajuk utama’ (headline news), and ‘teknologi’ (technology). Based on the experiment and evaluation using 377 training data and 95 testing data, producing five models with ten epoch for each model, CNN has the best percentage of accuracy around 90,74% and loss value around 29,05% for 300 hidden layers in classifying the Indonesian News data.</span

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

Article

Full-text available

Feb 2020

The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.

Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms

Article

Full-text available

Oct 2019

Objectives: Roman-Urdu considers as a non-standard language used frequently on the Internet. To classify text from article tagging on Roman-Urdu is such a difficult task because of many irregularities in spellings, for example, the word khubsurat (beautiful) in Roman-Urdu has multiple spellings. It can also be written as khoobsurat, khubsoorat, and khobsorat. Methods/Statistical Analysis: In this study, we scrap Roman-Urdu language news headlines from various online newspapers. Our corpus contains 12319 news headlines which contain seven categories i.e. Accident, Sports, Weather, Arrest, Conference, Operation, and Violence. We also use different preprocessing approaches like Roman-Urdu Stop words and apply IR models i.e. TF-IDF and Count Vector for feature extraction before applying classifier algorithms. Findings: We also compare results between different Machine Learning algorithm such as RF, LSVC, MNB, LR, RC, PAC, Perceptron, NC, SGDC and NC. Our model predicts best result to identify desire class on SGD classifier which gives 93.50% accuracy. Application/ Improvements: It is recommended that SGD Classifiers should be used in roman-Urdu news headline text classification. Keywords: Linear SVC, Multinomial Naïve Bays (MNB), Ridge Classifier (RC), Random Forest, Roman-Urdu, Supervised Machine Learning, Stochastic Gradient Descent (SGD), Text Classification, Tf-Id

Word2vec convolutional neural networks for classification of news articles and tweets

Article

Full-text available

Aug 2019
PLOS ONE

Big web data from sources including online news and Twitter are good resources for investigating deep learning. However, collected news articles and tweets almost certainly contain data unnecessary for learning, and this disturbs accurate learning. This paper explores the performance of word2vec Convolutional Neural Networks (CNNs) to classify news articles and tweets into related and unrelated ones. Using two word embedding algorithms of word2vec, Continuous Bag-of-Word (CBOW) and Skip-gram, we constructed CNN with the CBOW model and CNN with the Skip-gram model. We measured the classification accuracy of CNN with CBOW, CNN with Skip-gram, and CNN without word2vec models for real news articles and tweets. The experimental results indicated that word2vec significantly improved the accuracy of the classification model. The accuracy of the CBOW model was higher and more stable when compared to that of the Skip-gram model. The CBOW model exhibited better performance on news articles, and the Skip-gram model exhibited better performance on tweets. Specifically, CNN with word2vec models was more effective on news articles when compared to that on tweets because news articles are typically more uniform when compared to tweets.

A Survey on the Multiple Classifier for New Benchmark Dataset of Vietnamese News Classification

Conference Paper

Full-text available

Jan 2019

In this paper, we have summarized the well-known multi-class classifiers in the literature and applied them to evaluate on a new benchmark dataset of Vietnamese News (VNNews-01). This database is created from more than thirty Vietnamese online newspaper websites and grouped into twenty five categories. The proposition and evaluation of this work might promote the related research for Vietnamese text mining.

Nepali news classification using Naïve Bayes, Support Vector Machines and Neural Networks

Conference Paper

Full-text available

Feb 2018

Automated news classification is the task of categorizing news into some predefined category based on their content with the confidence learned from the training news dataset. This research evaluates some most widely used machine learning techniques, mainly Naive Bayes, SVM and Neural Networks, for automatic Nepali news classification problem. To experiment the system, a self-created Nepali News Corpus with 20 different categories and total 4964 documents, collected by crawling different online national news portals, is used. TF-IDF based features are extracted from the preprocessed documents to train and test the models. The average empirical results show that the SVM with RBF kernel is outperforming the other three algorithms with the classification accuracy of 74.65%. Then follows the linear SVM with accuracy 74.62%, Multilayer Perceptron Neural Networks with accuracy 72.99% and the Naive Bayes with accuracy 68.31%.

Empirical Study of Online News Classification Using Machine Learning Approaches

Conference Paper