Conference PaperPDF Available

Word Embedding based News Classification by using CNN

Authors:

Figures

Content may be subject to copyright.
Word Embedding based News Classification by
using CNN
Faisal Ahmed
Department of CSE
Premier University
Chattogram, Bangladesh
faisalcsecubd@gmail.com
Nazma Akther
Deptment of CSE
United International University (UIU)
Dhaka, Bangladesh
nazmacse2013@gmail.com
Mohammad Hasan
Department of CSE
Premier University
Chattogram, Bangladesh
mehedih256@gmail.com
Kibtia Chowdhury
Department of CSE
United International University (UIU)
Dhaka, Bangladesh
kchowdhury211056@mscse.uiu.ac.bd
Md. Saddam Hossain Mukta
Department of CSE
United International University (UIU)
Dhaka, Bangladesh
saddam@cse.uiu.ac.bd
Abstract—In this era of information technology, the number of
online news portal is increasing day by day. These online news
portals make a good profit by advertising different consumer
products to their reader. However, due to the lack of intelligence,
traditional news portals cannot identify what types of news are
preferred by the users. As a consequence, these news portals
most of the time show irrelevant advertisements to the readers
and incur a great economic loss to the advertisers. If these news
portals can identify what type of news a user is reading, then they
can provide contextual advertisements (showing advertisements
of news-related products) and gain more profit. Therefore, in
this paper, we proposed a method integrating word embedding
with Convolutional Neural Network (CNN) for the classification
of English news into four different categories: Sports, Business,
National and International. The performance of the proposed
method is evaluated on our prepared dataset in terms of macro-
f1 and micro-f1 scores. The experimental result shows that our
proposed method achieved macro-f1 and micro-f1 scores of 0.90
and 0.89, respectively which are significantly higher than that of
all the baseline methods.
Keywords—News Classification, Word Embedding, CNN, BoW,
Contextual Marketing, Machine learning
I. INTRODUCTION
With the amelioration of information technology, today mas-
sive amount of information have been stored in the electronic
form. Since vast amounts of news are available on the Internet,
it becomes a time consuming process for the people to access
the interesting one. In the financial sector news events are the
critical factor which change the financial market positively or
negatively. Therefore, classification of news is a very crucial
step to allow the user to enter their news of interest quickly and
effectively. Besides, news articles become a contemporary is-
sue for the company managers, policy makers and investors to
make better decisions. Usually people are interested in reading
news to acquire some information about his or her preferred
area from all over the world. However, News Classification is
a very challenging task in the field of text mining due to the
availability of the news in digital media. It is very tough for
editors to formulate structured information from unstructured
news data.
To develop such an automated news classification system
many researchers dedicated their work to classify the news
content. There are many techniques applied to the news corpus
for the classification of categories of textual data [1]. In
addition, many researchers work with social media textual
data to predict movie preference of viewers [2]. Social media
data is also used to classify the people’s choice based on
their behavior and values [3], [4]. Moreover, various machine
learning and deep learning models including Na¨
ıve Bayes [5],
neural networks [6], SVM [7][8] have been implemented
in order to classify the news category.In this study, we
have proposed a novel method to classify the multiple news
categories like Business, Sports, National and International
which are collected from Bangladeshi news portals. The main
contribution of this study is that dataset have been self-created
by scraping from different online news portals.
After that , to achieve our goal we used word embedding
with CNN [9]. The motivation of our proposed method is to
develop a content based news recommendation system which
suggests the user to select the relevant news from a huge
number of news. Besides, prepare a contextual advertising
which shows the meaningful advertisement to the end user .
For example, it is more meaningful to show advertisements of
sports materials than beauty products while someone is reading
news on sports.
The organization of this paper is as follows. Section-II
presents the literature review of this study, Section-III de-
scribes the data collection process from the scratch where
data is collected by web scraping, Section-IV describes the
proposed methodology of our study, and Section-V includes
our experimental results. Finally, Section-VI concludes the
paper.
609
2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on
Computational Science and Information Management (ICSECS-ICOCSIM)
978-1-6654-1407-4/21/$31.00 ©2021 IEEE
DOI 10.1109/ICSECS52883.2021.00117
2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM) | 978-1-6654-1407-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICSECS52883.2021.00117
II. LITERATURE REVIEW
A significant number of studies have been conducted on
online news classification using different types of multiclass
classifiers and now-a-days deep learning approaches are also
becoming popular for text mining or news categorization.
Researches have been done for English language news cat-
egorization and also for other languages. An intelligent web
news classification system has been proposed by Krishnalal G
et al. [1] where they used Hidden Markov Model (HMM) and
Support Vector Machine (SVM) to classify three categories
which are sports, finance and politics. They have collected
data from five popular Indian newspapers. Also a comparison
has been shown among KNN, SVM, and HMM-SVM.
Liliana et al. [7] proposed a machine learning model namely,
Support Vector Machine (SVM) to categorize the Indonesian
news with the accuracy of 85%. Umid Suleymanov et al. [10]
have designed a text classification system based on Na¨
ıve
Bayes, Support Vector Machine (SVM), and Artificial Neural
Network for categorizing Azerbaijani news articles. They
have formed a new text corpus and term frequency inverse
document frequency (TF-IDF) has been used for converting
the text to vector. H. Duong et al. [11] have summarized
the popular multi-class classifiers like K-NN, Na¨
ıve Bayes,
Logistic Regression, Decision Tree, Random forest, SVM,
OVO, and OVA and applied them on a new benchmark
dataset for Vietnamese news categorization belonging to 25
classes. TF-IDF has been used for feature extraction. They
have obtained the best result using the One-VS-All (OVA)
classifier comparing with other classifiers.Beside other lan-
guage news classification, some researchers have worked on
Bangla document classification. Md. Saiful Islam et al. [12]
have shown a comparative study among three supervised
machine learning algorithms e.g. SVM, Na¨
ıve Bayes, SGD for
Bangla document categorization. They have used Chi Square
distribution and TF-IDF for feature selection and also explored
these two techniques with the above mentioned three machine
learning algorithms. They have found SVM with TF-IDF
for feature selection gave the best result. Similarly, Shahi et
al. [13] have proposed a Nepali news classification system
using SVM, Na¨
ıve Bayes, and Neural Networks. They also
extracted feature using TF-IDF.
Beakcheol et al. [14] presented a robust Word2vec CNN
classification model to classify the news of articles and tweets.
They implemented two types of word embedding methods
such as CBOW (Continuous Bag-of-word) and Skip-gram
with deep neural network CNN. The experimental results
concluded that CBOW with CNN works better to classify
news articles. On the other hand, Skip-gram with CNN works
better for Tweets.Authors in [15] classify the news headline
of Roma-urdu language with the accuracy of 93.5%. They
observed that the SGD algorithm perfectly classifies the class
better than other machine learning algorithms. Furthermore,
other researchers have shown different techniques for news
classification or text mining like M.P. Akhter et al. [16]
have used Single-layer Multisize Filters Convolutional Neural
Network (SMFCNN) for document-level text classification and
Ali Ramdhani et al. [17] have classified the the Indonesian
news using CNN.
III. DATA COLLECTION
Data used in this experiment are collected from several
popular Bangladeshi daily English newspapers including the
Daily Star 1and the Daily Sun 2, etc. A python based web
scraper is built to gather the news titles, and body contents
labeled with the category name. We use BeautifulSoup which
is a built in python package to scrape the news contents from
the websites. Although newspaper websites have different web
structures to represent the news articles, we have designed a
generic algorithm so that we can collect data from various
newspaper sites. A pseudo code of the web scraper used in
this study is shown in Algorithm 1.
Algorithm 1 The pseudo code of proposed news scraper
1: Create the URL with date
2: Request the URL and save the page content
3: Create a soup using the BeatifulSoup library from the page
content
4: From the soup, find all ‘a’ html tags
5: for i in all ‘a’ tags do
6: Find the ‘href’ from i
7: Request and find the page content using the ’href’
8: Find the category and append in all_categories
list
9: Find the title and append in all_titles list
10: Create the soup and find the body of the news
11: Find all ‘p’ tags from the body
12: for p in body do
13: Get the text from p and save in a list
14: Merge all the text in a single paragraph list
15: end for
16: Append the paragraph in news_contents list
17: end for
All the news articles belong to four categories and the
number of instances in each category is given in table I.
TABLE I. Number of News in Different Categores
Category Number of news
National 397
International 267
Sports 188
Business 447
IV. METHODOLOGY
Our proposed method consist of three steps- Pre-processing,
Text Representation and Classification using CNN as shown in
Fig. 1
1https://thedailystar.net
2https://www.daily-sun.com
610
Fig. 1: Block Diagram of Proposed Method
A. Pre-processing
In the preprocessing step, all the texts in the news are
converted to lowercase. Then we remove punctuation marks,
digits and extra white spaces using regular expression. Ad-
ditionally, the stop words, the most common words in the
English language like “the”, “a”, “on”, “is”, “all” which do not
carry important meaning, are eliminated. We also performed
lemmatization to reduce inflectional forms of a word to a
common base form. All these tasks in preprocessing step
are performed using NLTK. A dictionary of terms is then
constructed considering all the news in the dataset. A special
token [PAD] is added in the dictionary for padding. In the
dictionary, the index 0 is kept reserved for the [PAD] token
and not assign to any word. For CNN, all the input text needs
to be of fixed length. Therefore, the news whose length is less
than the maximum length of news in the dataset is padded with
the [PAD] token. Each news is then converted to the sequence
of integers where each integer is the index of a token in a
dictionary. The ground truth of each news is represented in
one hot encoded form.
B. Text Representation
The dataset used in our study is small in size. Therefore,
after the preprocessing steps, the news is represented using
the word embedding method. Word embedding can capture
the context of a word in a document, semantic and syntactic
similarity as well as the relation with other words in the news.
Each vocabulary in the news is represented as a continuous
vector of 300 dimensions obtained from the pre-trained word
vectors trained on Wikipedia data using the skip-gram model.
The maximum length of the news in our training corpus is
150. Hence, each news in the corpus is represented as a
word embedding matrix of size 150x300 and then fed into
the network.
C. Classification of news using CNN
Recently, deep learning gains much popularity for text clas-
sification. The popular deep learning classification techniques
include CNN, gated recurrent unit (GRU) [18], LSTM [19],
and random weighted LSTM (RWL) [20]. To the best of
our knowledge, CNN has not been explored so far for news
classification. Therefore, CNN is used in this study for text
feature extraction and news classification. CNN is a feed
forward network model structure based on an artificial neural
network consisting of an input layer, a hidden layer, and
an output layer. The hidden layer in CNN is divided into
convolution layer and pooling layer to learn and extract images
or text features. The output layer is a fully connected layer that
performs the classification.
The CNN model used in this study consists of an initial
embedding layer that maps input news into a matrix followed
by a convolution layer of 32 filters each having kernel of
size 3 and a relu activation layer. The convolution layer or
feature extractor layer performs the convolution operation
by calculating the dot product between the kernel and the
receptive field of the input matrix. The process repeats until
the whole matrix is traversed and the output is input to the
relu activation layer.
The activation layer introduces the property of non-linearity
into the model. We used the relu activation function since it
makes the training process of the model easier and improves
the generalization performance compared to other activation
functions. After the activation layer, we applied a max-pooling
layer to downsample the feature maps which is further fol-
lowed by a flatten layer. Finally, to perform the classification,
a dense layer with a size of 4 is appended in the network to
represent the number of news classes with a softmax func-
tion. The softmax function determines the output probability
distribution of the four news classes. The architecture of the
network is visualized in the Fig. 2
The network is trained for 20 epochs with a batch size of
16 using adam optimization method to minimize categorical
cross-entropy loss shown in Eq. 1.
L(θ)=
1
C
C
i=1
yilogyi)(1)
where Cis the number of target classes, yis the one
hot representation of the ground truth and ˆyis the estimated
probability distribution assigned to the news classes by the
model.
V. E XPERIMENTS AND RESULTS
In this study, a novel method is proposed for the clas-
sification of English news into 4 different groups - Sports,
Country, World and Business from the news title and news
body using pre-trained word embedding and convolutional
neural network. The proposed method is compared with six
traditional machine learning algorithms including random for-
est (RF) [21], adaptive boosting (AdaBoost) [22], gradient
boosting tree (GBT) [23], decision tree (DT) [24], support
vector machine (SVM) [25] and k-nearest neighbour (kNN)
[26] as the baseline methods to measure its performance.
In the baseline methods, we obtain the Bag-of-Words (Bow)
representation of each news and calculate the term frequency
611
Fig. 2: Architecture of the CNN model to classify news categories
and inverse document frequency (TF-IDF) for each term in
the BoW representation. Next, the most significant terms
for the classification of news into four groups are selected
using the Analysis of Variance (ANOVA) hypothesis testing
method. The hypothesis testing is done on the TF-IDF of each
term(feature) for 4 different news categories. Based on the
test statistics we have selected 647 terms whose p-value are
less than the significance level of 0.01 and train the machine
learning classifiers. For both the proposed and the baseline
methods, we divide the dataset into train and test sets with
a ratio of 8:2. We train the models on the training set and
measure the performance on the test set. For both the proposed
and the baseline methods, we divide the dataset into train and
test sets with a ratio of 8:2. We train the model on the training
set and measure the performance on the test set.
Table II shows the results of the experiments. We used
macro and micro f1-score as evaluation measures in this paper.
Macro f1-score is an arithmetic mean of the per-class f1-score
while micro f1-score is computed combining micro-precision
and micro- recall over all the samples. According to the results,
our proposed method performs significantly better than the
baseline method in terms of both macro f1-score and micro f1-
score which are 90% and 89% respectively. The main reason
behind the better performance of the proposed method is that
we used pre-trained word embeddings for text representation
which can represent the semantic and syntactic relationship
among the words. On the other hand, BoW representation is
used in the baseline method which generates sparse vectors
in case of limited data. Therefore, the classifier cannot learn
the non-linear relationship between the news and category
properly. Consequently, the baseline method shows lower
performance than the proposed method.
TABLE II. News Classification Result Measured by Precision,
Recall and F-score
Macro
Pre.
Micro
Pre.
Macro
Rec.
Micro
Rec.
Macro
f1
Micro
f1
Proposed 0.91 0.89 0.89 0.89 0.90 0.89
RF+BoW 0.88 0.86 0.85 0.85 0.86 0.85
AdaBoost+BoW 0.69 0.65 0.64 0.62 0.65 0.63
GBT+BoW 0.75 0.73 0.68 0.69 0.70 0.70
DT+BoW 0.68 0.67 0.65 0.66 0.66 0.66
SVM+BoW 0.89 0.87 0.84 0.85 0.86 0.86
kNN+BoW 0.87 0.85 0.86 0.85 0.86 0.85
The class wise precision, recall f1-score of the proposed
method are shown in the Table III
TABLE III. Precision, recall and f1-score for each class
obtained using proposed method
Precision Recall F-score
Sports 1.00 0.85 0.92
National 0.81 0.89 0.85
International 0.93 0.95 0.94
Business 0.90 0.87 0.89
As we can see from the Table III, the proposed method
achieves the highest performance in estimating the Sports
category and exhibits the lowest performance in identifying
the National category of news.
For a better visual representation, we also show the class-
wise ROC curves of the proposed method in Fig. 3 From
the figure, we can see that for all classes the area under the
curves (AUCs) is significantly high which demonstrates the
effectiveness of our proposed method in news classification.
Fig. 3: Class-wise receiver operating characteristics curve of
proposed method
VI. CONCLUSION
In this study, a method is proposed for news classification
from news title and body using word embedding and CNN
model. To measure the performance of the method, a dataset
is prepared consisting of four different categories of news. The
612
result shows that the proposed method outperforms the base-
line methods to a significant margin. The proposed method can
be used in online news portals for contextual advertisements.
In the future, we plan to increase the size of our dataset as
well as add more news categories. Moreover, extending the
proposed method for the Bangali news can be investigated.
REFERENCES
[1] G. Krishnalal, S. B. Rengarajan, and K. Srinivasagan, “A new text
mining approach based on hmm-svm for web news classification,”
International Journal of Computer Applications, vol. 1, no. 19, pp. 98–
104, 2010.
[2] E. M. Khan, M. S. H. Mukta, M. E. Ali, and J. Mahmud, “Predicting
users’ movie preference and rating behavior from personality and
values,” ACM Transactions on Interactive Intelligent Systems (TiiS),
vol. 10, no. 3, pp. 1–25, 2020.
[3] M. M. Rahman, M. T. H. Majumder, M. S. H. Mukta, M. E. Ali,
and J. Mahmud, “Can we predict eat-out preference of a person from
tweets?,” in Proceedings of the 8th ACM Conference on Web Science,
pp. 350–351, 2016.
[4] M. S. H. Mukta, A. S. Sakib, M. A. Islam, M. E. Ali, M. Ahmed, and
M. A. Rifat, “Friends’ influence driven users’ value change prediction
from social media usage,” SBP-BRiMS, 2021.
[5] G. Septian, A. Susanto, and G. F. Shidik, “Indonesian news classification
based on nabana,” in 2017 International Seminar on Application for
Technology of Information and Communication (iSemantic), pp. 175–
180, IEEE, 2017.
[6] S. Kaur and N. K. Khiva, “Online news classification using deep
learning technique,” International Research Journal of Engineering and
Technology (IRJET), vol. 3, no. 10, pp. 558–563, 2016.
[7] D. Y. Liliana, A. Hardianto, and M. Ridok, “Indonesian news clas-
sification using support vector machine,” World Academy of Science,
Engineering and Technology, vol. 57, pp. 767–770, 2011.
[8] I. Dilrukshi, K. De Zoysa, and A. Caldera, “Twitter news classification
using svm,” in 2013 8th International Conference on Computer Science
& Education, pp. 287–291, IEEE, 2013.
[9] P. Kim, “Convolutional neural network, in MATLAB deep learning,
pp. 121–147, Springer, 2017.
[10] U. Suleymanov, S. Rustamov, M. Zulfugarov, O. Orujov, N. Musayev,
and A. Alizade, “Empirical study of online news classification using
machine learning approaches,” in 2018 IEEE 12th International Confer-
ence on Application of Information and Communication Technologies
(AICT), pp. 1–6, IEEE, 2018.
[11] H.-T. Duong and V. T. Hoang, A survey on the multiple classifier for
new benchmark dataset of vietnamese news classification, in 2019 11th
International Conference on Knowledge and Smart Technology (KST),
pp. 23–28, IEEE, 2019.
[12] M. Islam, F. E. M. Jubayer, S. I. Ahmed, et al., “A comparative study
on different types of approaches to bengali document categorization,
arXiv preprint arXiv:1701.08694, 2017.
[13] T. B. Shahi and A. K. Pant, “Nepali news classification using na¨
ıve
bayes, support vector machines and neural networks,” in 2018 Inter-
national Conference on Communication Information and Computing
Technology (ICCICT), pp. 1–5, IEEE, 2018.
[14] B. Jang, I. Kim, and J. W. Kim, “Word2vec convolutional neural
networks for classification of news articles and tweets, PloS one,
vol. 14, no. 8, p. e0220976, 2019.
[15] S. M. Hassan, F. Ali, S. Wasi, S. Javeed, I. Hussain, and S. N. Ashraf,
“Roman-urdu news headline classification with ir models using machine
learning algorithms,” Indian Journal of Science and Technology, vol. 12,
no. 35, pp. 1–9, 2019.
[16] M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood,
and M. T. Sadiq, “Document-level text classification using single-layer
multisize filters convolutional neural network, IEEE Access, vol. 8,
pp. 42689–42707, 2020.
[17] M. A. Ramdhani, D. S. Maylawati, and T. Mantoro, “Indonesian news
classification using convolutional neural network, Indonesian Journal of
Electrical Engineering and Computer Science, vol. 19, no. 2, pp. 1000–
1009, 2020.
[18] K. Cho, B. Van Merri¨
enboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[20] A. S. Al Rafi, T. Rahman, A. R. Al Abir, T. A. Rajib, M. Islam, and
M. S. H. Mukta, “A new classification technique: random weighted lstm
(rwl),” in 2020 IEEE Region 10 Symposium (TENSYMP), pp. 262–265,
IEEE, 2020.
[21] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.
[22] Y. Freund, R. Schapire, and N. Abe, A short introduction to boosting,”
Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-
780, p. 1612, 1999.
[23] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics
& data analysis, vol. 38, no. 4, pp. 367–378, 2002.
[24] Y.-Y. Song and L. Ying, “Decision tree methods: applications for
classification and prediction,” Shanghai archives of psychiatry, vol. 27,
no. 2, p. 130, 2015.
[25] C. Cortes, “Wsupport-vector network,” Machine Learning, vol. 20,
pp. 1–25, 1995.
[26] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2,
p. 1883, 2009.
613
... The other similar examples can be found in [27,28]. Furthermore, the study [29] discusses the problem of online news websites wasting money on advertising by showing readers ads that are of little interest to them. Using a mix of word embedding and Convolutional Neural Network (CNN), the authors suggest a method for categorizing English news into four distinct groups. ...
Article
Full-text available
The study focuses on news category prediction and investigates the performance of sentence embedding of four transformer models (BERT, RoBERTa, MPNet, and T5) and their variants as feature vectors when combined with Softmax and Random Forest using two accessible news datasets from Kaggle. The data are stratified into train and test sets to ensure equal representation of each category. Word embeddings are generated using transformer models, with the last hidden layer selected as the embedding. Mean pooling calculates a single vector representation called sentence embedding, capturing the overall meaning of the news article. The performance of Softmax and Random Forest, as well as the soft voting of both, is evaluated using evaluation measures such as accuracy, F1 score, precision, and recall. The study also contributes by evaluating the performance of Softmax and Random Forest individually. The macro-average F1 score is calculated to compare the performance of different transformer embeddings in the same experimental settings. The experiments reveal that MPNet versions v1 and v3 achieve the highest F1 score of 97.7% when combined with Random Forest, while T5 Large embedding achieves the highest F1 score of 98.2% when used with Softmax regression. MPNet v1 performs exceptionally well when used in the voting classifier, obtaining an impressive F1 score of 98.6%. In conclusion, the experiments validate the superiority of certain transformer models, such as MPNet v1, MPNet v3, and DistilRoBERTa, when used to calculate sentence embeddings within the Random Forest framework. The results also highlight the promising performance of T5 Large and RoBERTa Large in voting of Softmax regression and Random Forest. The voting classifier, employing transformer embeddings and ensemble learning techniques, consistently outperforms other baselines and individual algorithms. These findings emphasize the effectiveness of the voting classifier with transformer embeddings in achieving accurate and reliable predictions for news category classification tasks.
... An explainable artificial intelligent technique for tomato leaf disease diagnosis using the belief rule-based expert system [12,14,5,15,13,22,23,17,4,7] will be investigated in the future. We will also implement Federated learning of Convolutional Neural Networks [3,8,16,6,25,19] for multi-institutional collaboration for tomato leaf disease diagnosis. ...
Chapter
Full-text available
Tomato leaves can be infected with various infectious viruses and fungal diseases that drastically reduce tomato production and incur a great economic loss. Therefore, tomato leaf disease detection and identification are crucial for maintaining the global demand for tomatoes for a large population. This paper proposes a machine learning-based technique to identify diseases on tomato leaves and classify them into three diseases (Septoria, Yellow Curl Leaf, and Late Blight) and one healthy class. The proposed method extracts radiomics-based features from tomato leaf images and identifies the disease with a gradient boosting classifier. The dataset used in this study consists of 4000 tomato leaf disease images collected from the Plant Village dataset. The experimental results demonstrate the effectiveness and applicability of our proposed method for tomato leaf disease detection and classification.KeywordsTomato leaf diseaseMachine learningRadiomics featuresClassification
... Researchers prefer the development of CNN-based architecture to provide accurate medical diagnoses in the present day. This study developed an automated skin cancer classifier using a shallow CNN model [22]. The model structure was consciously designed to reduce the computational cost and complexity required to accurately extract image features and classify those. ...
Conference Paper
Skin cancer is a fatal disease that has become the leading cause of death worldwide in recent years, although it is curable if diagnosed early. Early skin cancer detection significantly improves patients' chances of survival and reduces mortality. In this research, we conduct experiments on a high imbalance dermoscopic ISIC 2020 dataset. The primary objective of this study is to develop a shallow CNN architecture to complete the classification task effectively, requiring fewer computational resources without compromising accuracy. We have used pre-processing techniques to remove image noise and truncation and augmentation techniques to balance the dataset before fitting it into the model. Multiple performance measurement metrics were utilized to establish the overall performance. Our proposed model yields a remarkable test accuracy of 98.81%. We compare our models' performance with different transfer learning (TL) models to assess the faster convergence rate. The proposed model demonstrated its robustness by outperforming the other TL models in terms of accuracy within a short processing time. It is reasonable to assume that our proposed system will reliably aid dermatologists in diagnosing skin cancer patients early and increasing survival rates.
... Researchers prefer the development of CNN-based architecture to provide accurate medical diagnoses in the present day. This study developed an automated skin cancer classifier using a shallow CNN model [22]. The model structure was consciously designed to reduce the computational cost and complexity required to accurately extract image features and classify those. ...
Conference Paper
Full-text available
Skin cancer is a fatal disease that has become the leading cause of death worldwide in recent years, although it is curable if diagnosed early. Early skin cancer detection significantly improves patients' chances of survival and reduces mortality. In this research, we conduct experiments on a high imbalance dermoscopic ISIC 2020 dataset. The primary objective of this study is to develop a shallow CNN architecture to complete the classification task effectively, requiring fewer computational resources without compromising accuracy. We have used pre-processing techniques to remove image noise and truncation and augmentation techniques to balance the dataset before fitting it into the model. Multiple performance measurement metrics were utilized to establish the overall performance. Our proposed model yields a remarkable test accuracy of 98.81%. We compare our models' performance with different transfer learning (TL) models to assess the faster convergence rate. The proposed model demonstrated its robustness by outperforming the other TL models in terms of accuracy within a short processing time. It is reasonable to assume that our proposed system will reliably aid dermatologists in diagnosing skin cancer patients early and increasing survival rates.
Article
Full-text available
Traditional Machine Learning (ML) models are generally preferred for classification tasks on tabular datasets, which often produce unsatisfactory results in complex tabular datasets. Recent works, using Convolutional Neural Networks (CNN) with embedding techniques, outperform the traditional classifiers on tabular dataset. However, these embedding techniques fail to use an automated approach after analyzing the importance of the features in the dataset accurately. This study introduces a novel feature embedding technique named Dynamic Weighted Tabular Method (DWTM), which dynamically uses feature weights based on their strength of the correlations to the class labels during applying any CNN architectures on the tabular datasets. DWTM converts each data point into images and then feeds to a CNN architecture. It dynamically embeds the features of the tabular dataset based on their strength and assigns pixel positions to the appropriate features in the image canvas space instead of using any static configuration. In this paper, DWTM embedding method is applied over six benchmark tabular datasets independently by using three different CNN architectures (i.e., ResNet-18, DenseNet and InceptionV1) and an outstanding performance (an average accuracy of 98%) has obtained, which outperforms any traditional and CNN based classifiers as well.
Conference Paper
Full-text available
Basic human values represent a set of values such as security, independence, success, kindness, and pleasure, which we deem important to our lives. The value priority of a person may change over time due to different factors such as life experiences, influence, social structure and technology. In this study, we show that we can predict the value change of a person by considering both the influence of her friends and her social media usage. This is the first work in the literature that relates the influence of social media friends on the human value dynamics of a user. We propose a Bounded Confidence Model (BCM) based value dynamics model from 275 different ego networks in Facebook that predicts how social influence may persuade a person to change her value over time. Then, to predict better, we use a particle swarm optimization based hyperparameter tuning technique. We observe that these optimized hyperparameters produce more accurate future value score. We also run our approach with different machine learning based methods and find support vector regressor (SVR) outperforms other regressor models. By using SVR with the best hyperparameters of BCM model, we find the lowest Mean Squared Error (MSE) score as 0.00347.
Article
Full-text available
Due to the unprecedented growth of digital devices and hand held smartphones, users generate several quintillion of data everyday by using social media, blog, youtube, etc. With the advancement of machine learning techniques, we classify these data automatically for critical decision making process. Majority of these classification algorithms are linear in nature and these algorithms show weak performance in predicting class labels when the attributes are complex. For example, human behavior, preference, personality, etc. have numerous non-linear properties and difficult to predict in real life by using these traditional machine learning algorithms. In this paper, we propose a novel non-linear technique based on Long short-term memory (LSTM) architecture. Studies show that Recurrent Neural Network (RNN) and LSTM based models usually predict time and sequential models better than that of other models. We significantly change the operational mechanism of LSTM and achieve outstanding performance in predicting classification problems. We run our algorithm over six different datasets: Iris, Pima Indian, Breast Cancer, Blood Transfusion, StackOverflow, and Banknote Authentication. We compare the performance of our algorithm with other traditional classifiers. Our classifier generally outperforms conventional linear and non-linear classifiers.
Article
Full-text available
In this article, we propose novel techniques to predict a user's movie genre preference and rating behavior from her psycholinguistic attributes obtained from the social media interactions. The motivation of this work comes from various psychological studies that demonstrate that psychological attributes such as personality and values can influence one's decision or choice in real life. In this work, we integrate user interactions in Twitter and IMDb to derive interesting relations between human psychological attributes and their movie preferences. In particular, we first predict a user's movie genre preferences from the personality and value scores of the user derived from her tweets. Second, we also develop models to predict user movie rating behavior from her tweets in Twitter and movie genre and storyline preferences from IMDb. We further strengthen the movie rating model by incorporating the user reviews. In the above models, we investigate the role of personality and values independently and combinedly while predicting movie genre preferences and movie rating behaviors. We find that our combined models significantly improve the accuracy than that of a single model that is built by using personality or values independently. We also compare our technique with the traditional movie genre and rating prediction techniques. The experimental results show that our models are effective in recommending movies to users.
Article
Full-text available
span>Every language has unique characteristics, structures, and grammar. Thus, different styles will have different processes and result in processed in Natural Language Processing (NLP) research area. In the current NLP research area, Data Mining (DM) or Machine Learning (ML) technique is popular, especially for Deep Learning (DL) method. This research aims to classify text data in the Indonesian language using Convolutional Neural Network (CNN) as one of the DL algorithms. The CNN algorithm used modified following the Indonesian language characteristics. Thereby, in the text pre-processing phase, stopword removal and stemming are particularly suitable for the Indonesian language. The experiment conducted using 472 Indonesian News text data from various sources with four categories: ‘hiburan’ (entertainment), ‘olahraga’ (sport), ‘tajuk utama’ (headline news), and ‘teknologi’ (technology). Based on the experiment and evaluation using 377 training data and 95 testing data, producing five models with ten epoch for each model, CNN has the best percentage of accuracy around 90,74% and loss value around 29,05% for 300 hidden layers in classifying the Indonesian News data.</span
Article
Full-text available
The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.
Article
Full-text available
Objectives: Roman-Urdu considers as a non-standard language used frequently on the Internet. To classify text from article tagging on Roman-Urdu is such a difficult task because of many irregularities in spellings, for example, the word khubsurat (beautiful) in Roman-Urdu has multiple spellings. It can also be written as khoobsurat, khubsoorat, and khobsorat. Methods/Statistical Analysis: In this study, we scrap Roman-Urdu language news headlines from various online newspapers. Our corpus contains 12319 news headlines which contain seven categories i.e. Accident, Sports, Weather, Arrest, Conference, Operation, and Violence. We also use different preprocessing approaches like Roman-Urdu Stop words and apply IR models i.e. TF-IDF and Count Vector for feature extraction before applying classifier algorithms. Findings: We also compare results between different Machine Learning algorithm such as RF, LSVC, MNB, LR, RC, PAC, Perceptron, NC, SGDC and NC. Our model predicts best result to identify desire class on SGD classifier which gives 93.50% accuracy. Application/ Improvements: It is recommended that SGD Classifiers should be used in roman-Urdu news headline text classification. Keywords: Linear SVC, Multinomial Naïve Bays (MNB), Ridge Classifier (RC), Random Forest, Roman-Urdu, Supervised Machine Learning, Stochastic Gradient Descent (SGD), Text Classification, Tf-Id
Article
Full-text available
Big web data from sources including online news and Twitter are good resources for investigating deep learning. However, collected news articles and tweets almost certainly contain data unnecessary for learning, and this disturbs accurate learning. This paper explores the performance of word2vec Convolutional Neural Networks (CNNs) to classify news articles and tweets into related and unrelated ones. Using two word embedding algorithms of word2vec, Continuous Bag-of-Word (CBOW) and Skip-gram, we constructed CNN with the CBOW model and CNN with the Skip-gram model. We measured the classification accuracy of CNN with CBOW, CNN with Skip-gram, and CNN without word2vec models for real news articles and tweets. The experimental results indicated that word2vec significantly improved the accuracy of the classification model. The accuracy of the CBOW model was higher and more stable when compared to that of the Skip-gram model. The CBOW model exhibited better performance on news articles, and the Skip-gram model exhibited better performance on tweets. Specifically, CNN with word2vec models was more effective on news articles when compared to that on tweets because news articles are typically more uniform when compared to tweets.
Conference Paper
Full-text available
In this paper, we have summarized the well-known multi-class classifiers in the literature and applied them to evaluate on a new benchmark dataset of Vietnamese News (VNNews-01). This database is created from more than thirty Vietnamese online newspaper websites and grouped into twenty five categories. The proposition and evaluation of this work might promote the related research for Vietnamese text mining.
Conference Paper
Full-text available
Automated news classification is the task of categorizing news into some predefined category based on their content with the confidence learned from the training news dataset. This research evaluates some most widely used machine learning techniques, mainly Naive Bayes, SVM and Neural Networks, for automatic Nepali news classification problem. To experiment the system, a self-created Nepali News Corpus with 20 different categories and total 4964 documents, collected by crawling different online national news portals, is used. TF-IDF based features are extracted from the preprocessed documents to train and test the models. The average empirical results show that the SVM with RBF kernel is outperforming the other three algorithms with the classification accuracy of 74.65%. Then follows the linear SVM with accuracy 74.62%, Multilayer Perceptron Neural Networks with accuracy 72.99% and the Naive Bayes with accuracy 68.31%.