Content uploaded by Nazma Akther
Author content
All content in this area was uploaded by Nazma Akther on Feb 16, 2024
Content may be subject to copyright.
Content uploaded by Faisal Ahmed
Author content
All content in this area was uploaded by Faisal Ahmed on Aug 04, 2022
Content may be subject to copyright.
Word Embedding based News Classification by
using CNN
Faisal Ahmed
Department of CSE
Premier University
Chattogram, Bangladesh
faisalcsecubd@gmail.com
Nazma Akther
Deptment of CSE
United International University (UIU)
Dhaka, Bangladesh
nazmacse2013@gmail.com
Mohammad Hasan
Department of CSE
Premier University
Chattogram, Bangladesh
mehedih256@gmail.com
Kibtia Chowdhury
Department of CSE
United International University (UIU)
Dhaka, Bangladesh
kchowdhury211056@mscse.uiu.ac.bd
Md. Saddam Hossain Mukta
Department of CSE
United International University (UIU)
Dhaka, Bangladesh
saddam@cse.uiu.ac.bd
Abstract—In this era of information technology, the number of
online news portal is increasing day by day. These online news
portals make a good profit by advertising different consumer
products to their reader. However, due to the lack of intelligence,
traditional news portals cannot identify what types of news are
preferred by the users. As a consequence, these news portals
most of the time show irrelevant advertisements to the readers
and incur a great economic loss to the advertisers. If these news
portals can identify what type of news a user is reading, then they
can provide contextual advertisements (showing advertisements
of news-related products) and gain more profit. Therefore, in
this paper, we proposed a method integrating word embedding
with Convolutional Neural Network (CNN) for the classification
of English news into four different categories: Sports, Business,
National and International. The performance of the proposed
method is evaluated on our prepared dataset in terms of macro-
f1 and micro-f1 scores. The experimental result shows that our
proposed method achieved macro-f1 and micro-f1 scores of 0.90
and 0.89, respectively which are significantly higher than that of
all the baseline methods.
Keywords—News Classification, Word Embedding, CNN, BoW,
Contextual Marketing, Machine learning
I. INTRODUCTION
With the amelioration of information technology, today mas-
sive amount of information have been stored in the electronic
form. Since vast amounts of news are available on the Internet,
it becomes a time consuming process for the people to access
the interesting one. In the financial sector news events are the
critical factor which change the financial market positively or
negatively. Therefore, classification of news is a very crucial
step to allow the user to enter their news of interest quickly and
effectively. Besides, news articles become a contemporary is-
sue for the company managers, policy makers and investors to
make better decisions. Usually people are interested in reading
news to acquire some information about his or her preferred
area from all over the world. However, News Classification is
a very challenging task in the field of text mining due to the
availability of the news in digital media. It is very tough for
editors to formulate structured information from unstructured
news data.
To develop such an automated news classification system
many researchers dedicated their work to classify the news
content. There are many techniques applied to the news corpus
for the classification of categories of textual data [1]. In
addition, many researchers work with social media textual
data to predict movie preference of viewers [2]. Social media
data is also used to classify the people’s choice based on
their behavior and values [3], [4]. Moreover, various machine
learning and deep learning models including Na¨
ıve Bayes [5],
neural networks [6], SVM [7][8] have been implemented
in order to classify the news category.In this study, we
have proposed a novel method to classify the multiple news
categories like Business, Sports, National and International
which are collected from Bangladeshi news portals. The main
contribution of this study is that dataset have been self-created
by scraping from different online news portals.
After that , to achieve our goal we used word embedding
with CNN [9]. The motivation of our proposed method is to
develop a content based news recommendation system which
suggests the user to select the relevant news from a huge
number of news. Besides, prepare a contextual advertising
which shows the meaningful advertisement to the end user .
For example, it is more meaningful to show advertisements of
sports materials than beauty products while someone is reading
news on sports.
The organization of this paper is as follows. Section-II
presents the literature review of this study, Section-III de-
scribes the data collection process from the scratch where
data is collected by web scraping, Section-IV describes the
proposed methodology of our study, and Section-V includes
our experimental results. Finally, Section-VI concludes the
paper.
609
2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on
Computational Science and Information Management (ICSECS-ICOCSIM)
978-1-6654-1407-4/21/$31.00 ©2021 IEEE
DOI 10.1109/ICSECS52883.2021.00117
2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM) | 978-1-6654-1407-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICSECS52883.2021.00117
II. LITERATURE REVIEW
A significant number of studies have been conducted on
online news classification using different types of multiclass
classifiers and now-a-days deep learning approaches are also
becoming popular for text mining or news categorization.
Researches have been done for English language news cat-
egorization and also for other languages. An intelligent web
news classification system has been proposed by Krishnalal G
et al. [1] where they used Hidden Markov Model (HMM) and
Support Vector Machine (SVM) to classify three categories
which are sports, finance and politics. They have collected
data from five popular Indian newspapers. Also a comparison
has been shown among KNN, SVM, and HMM-SVM.
Liliana et al. [7] proposed a machine learning model namely,
Support Vector Machine (SVM) to categorize the Indonesian
news with the accuracy of 85%. Umid Suleymanov et al. [10]
have designed a text classification system based on Na¨
ıve
Bayes, Support Vector Machine (SVM), and Artificial Neural
Network for categorizing Azerbaijani news articles. They
have formed a new text corpus and term frequency inverse
document frequency (TF-IDF) has been used for converting
the text to vector. H. Duong et al. [11] have summarized
the popular multi-class classifiers like K-NN, Na¨
ıve Bayes,
Logistic Regression, Decision Tree, Random forest, SVM,
OVO, and OVA and applied them on a new benchmark
dataset for Vietnamese news categorization belonging to 25
classes. TF-IDF has been used for feature extraction. They
have obtained the best result using the One-VS-All (OVA)
classifier comparing with other classifiers.Beside other lan-
guage news classification, some researchers have worked on
Bangla document classification. Md. Saiful Islam et al. [12]
have shown a comparative study among three supervised
machine learning algorithms e.g. SVM, Na¨
ıve Bayes, SGD for
Bangla document categorization. They have used Chi Square
distribution and TF-IDF for feature selection and also explored
these two techniques with the above mentioned three machine
learning algorithms. They have found SVM with TF-IDF
for feature selection gave the best result. Similarly, Shahi et
al. [13] have proposed a Nepali news classification system
using SVM, Na¨
ıve Bayes, and Neural Networks. They also
extracted feature using TF-IDF.
Beakcheol et al. [14] presented a robust Word2vec CNN
classification model to classify the news of articles and tweets.
They implemented two types of word embedding methods
such as CBOW (Continuous Bag-of-word) and Skip-gram
with deep neural network CNN. The experimental results
concluded that CBOW with CNN works better to classify
news articles. On the other hand, Skip-gram with CNN works
better for Tweets.Authors in [15] classify the news headline
of Roma-urdu language with the accuracy of 93.5%. They
observed that the SGD algorithm perfectly classifies the class
better than other machine learning algorithms. Furthermore,
other researchers have shown different techniques for news
classification or text mining like M.P. Akhter et al. [16]
have used Single-layer Multisize Filters Convolutional Neural
Network (SMFCNN) for document-level text classification and
Ali Ramdhani et al. [17] have classified the the Indonesian
news using CNN.
III. DATA COLLECTION
Data used in this experiment are collected from several
popular Bangladeshi daily English newspapers including the
Daily Star 1and the Daily Sun 2, etc. A python based web
scraper is built to gather the news titles, and body contents
labeled with the category name. We use BeautifulSoup which
is a built in python package to scrape the news contents from
the websites. Although newspaper websites have different web
structures to represent the news articles, we have designed a
generic algorithm so that we can collect data from various
newspaper sites. A pseudo code of the web scraper used in
this study is shown in Algorithm 1.
Algorithm 1 The pseudo code of proposed news scraper
1: Create the URL with date
2: Request the URL and save the page content
3: Create a soup using the BeatifulSoup library from the page
content
4: From the soup, find all ‘a’ html tags
5: for i in all ‘a’ tags do
6: Find the ‘href’ from i
7: Request and find the page content using the ’href’
8: Find the category and append in all_categories
list
9: Find the title and append in all_titles list
10: Create the soup and find the body of the news
11: Find all ‘p’ tags from the body
12: for p in body do
13: Get the text from p and save in a list
14: Merge all the text in a single paragraph list
15: end for
16: Append the paragraph in news_contents list
17: end for
All the news articles belong to four categories and the
number of instances in each category is given in table I.
TABLE I. Number of News in Different Categores
Category Number of news
National 397
International 267
Sports 188
Business 447
IV. METHODOLOGY
Our proposed method consist of three steps- Pre-processing,
Text Representation and Classification using CNN as shown in
Fig. 1
1https://thedailystar.net
2https://www.daily-sun.com
610
Fig. 1: Block Diagram of Proposed Method
A. Pre-processing
In the preprocessing step, all the texts in the news are
converted to lowercase. Then we remove punctuation marks,
digits and extra white spaces using regular expression. Ad-
ditionally, the stop words, the most common words in the
English language like “the”, “a”, “on”, “is”, “all” which do not
carry important meaning, are eliminated. We also performed
lemmatization to reduce inflectional forms of a word to a
common base form. All these tasks in preprocessing step
are performed using NLTK. A dictionary of terms is then
constructed considering all the news in the dataset. A special
token [PAD] is added in the dictionary for padding. In the
dictionary, the index 0 is kept reserved for the [PAD] token
and not assign to any word. For CNN, all the input text needs
to be of fixed length. Therefore, the news whose length is less
than the maximum length of news in the dataset is padded with
the [PAD] token. Each news is then converted to the sequence
of integers where each integer is the index of a token in a
dictionary. The ground truth of each news is represented in
one hot encoded form.
B. Text Representation
The dataset used in our study is small in size. Therefore,
after the preprocessing steps, the news is represented using
the word embedding method. Word embedding can capture
the context of a word in a document, semantic and syntactic
similarity as well as the relation with other words in the news.
Each vocabulary in the news is represented as a continuous
vector of 300 dimensions obtained from the pre-trained word
vectors trained on Wikipedia data using the skip-gram model.
The maximum length of the news in our training corpus is
150. Hence, each news in the corpus is represented as a
word embedding matrix of size 150x300 and then fed into
the network.
C. Classification of news using CNN
Recently, deep learning gains much popularity for text clas-
sification. The popular deep learning classification techniques
include CNN, gated recurrent unit (GRU) [18], LSTM [19],
and random weighted LSTM (RWL) [20]. To the best of
our knowledge, CNN has not been explored so far for news
classification. Therefore, CNN is used in this study for text
feature extraction and news classification. CNN is a feed
forward network model structure based on an artificial neural
network consisting of an input layer, a hidden layer, and
an output layer. The hidden layer in CNN is divided into
convolution layer and pooling layer to learn and extract images
or text features. The output layer is a fully connected layer that
performs the classification.
The CNN model used in this study consists of an initial
embedding layer that maps input news into a matrix followed
by a convolution layer of 32 filters each having kernel of
size 3 and a relu activation layer. The convolution layer or
feature extractor layer performs the convolution operation
by calculating the dot product between the kernel and the
receptive field of the input matrix. The process repeats until
the whole matrix is traversed and the output is input to the
relu activation layer.
The activation layer introduces the property of non-linearity
into the model. We used the relu activation function since it
makes the training process of the model easier and improves
the generalization performance compared to other activation
functions. After the activation layer, we applied a max-pooling
layer to downsample the feature maps which is further fol-
lowed by a flatten layer. Finally, to perform the classification,
a dense layer with a size of 4 is appended in the network to
represent the number of news classes with a softmax func-
tion. The softmax function determines the output probability
distribution of the four news classes. The architecture of the
network is visualized in the Fig. 2
The network is trained for 20 epochs with a batch size of
16 using adam optimization method to minimize categorical
cross-entropy loss shown in Eq. 1.
L(θ)=−
1
C
C
i=1
yilog(ˆyi)(1)
where Cis the number of target classes, yis the one
hot representation of the ground truth and ˆyis the estimated
probability distribution assigned to the news classes by the
model.
V. E XPERIMENTS AND RESULTS
In this study, a novel method is proposed for the clas-
sification of English news into 4 different groups - Sports,
Country, World and Business from the news title and news
body using pre-trained word embedding and convolutional
neural network. The proposed method is compared with six
traditional machine learning algorithms including random for-
est (RF) [21], adaptive boosting (AdaBoost) [22], gradient
boosting tree (GBT) [23], decision tree (DT) [24], support
vector machine (SVM) [25] and k-nearest neighbour (kNN)
[26] as the baseline methods to measure its performance.
In the baseline methods, we obtain the Bag-of-Words (Bow)
representation of each news and calculate the term frequency
611
Fig. 2: Architecture of the CNN model to classify news categories
and inverse document frequency (TF-IDF) for each term in
the BoW representation. Next, the most significant terms
for the classification of news into four groups are selected
using the Analysis of Variance (ANOVA) hypothesis testing
method. The hypothesis testing is done on the TF-IDF of each
term(feature) for 4 different news categories. Based on the
test statistics we have selected 647 terms whose p-value are
less than the significance level of 0.01 and train the machine
learning classifiers. For both the proposed and the baseline
methods, we divide the dataset into train and test sets with
a ratio of 8:2. We train the models on the training set and
measure the performance on the test set. For both the proposed
and the baseline methods, we divide the dataset into train and
test sets with a ratio of 8:2. We train the model on the training
set and measure the performance on the test set.
Table II shows the results of the experiments. We used
macro and micro f1-score as evaluation measures in this paper.
Macro f1-score is an arithmetic mean of the per-class f1-score
while micro f1-score is computed combining micro-precision
and micro- recall over all the samples. According to the results,
our proposed method performs significantly better than the
baseline method in terms of both macro f1-score and micro f1-
score which are 90% and 89% respectively. The main reason
behind the better performance of the proposed method is that
we used pre-trained word embeddings for text representation
which can represent the semantic and syntactic relationship
among the words. On the other hand, BoW representation is
used in the baseline method which generates sparse vectors
in case of limited data. Therefore, the classifier cannot learn
the non-linear relationship between the news and category
properly. Consequently, the baseline method shows lower
performance than the proposed method.
TABLE II. News Classification Result Measured by Precision,
Recall and F-score
Macro
Pre.
Micro
Pre.
Macro
Rec.
Micro
Rec.
Macro
f1
Micro
f1
Proposed 0.91 0.89 0.89 0.89 0.90 0.89
RF+BoW 0.88 0.86 0.85 0.85 0.86 0.85
AdaBoost+BoW 0.69 0.65 0.64 0.62 0.65 0.63
GBT+BoW 0.75 0.73 0.68 0.69 0.70 0.70
DT+BoW 0.68 0.67 0.65 0.66 0.66 0.66
SVM+BoW 0.89 0.87 0.84 0.85 0.86 0.86
kNN+BoW 0.87 0.85 0.86 0.85 0.86 0.85
The class wise precision, recall f1-score of the proposed
method are shown in the Table III
TABLE III. Precision, recall and f1-score for each class
obtained using proposed method
Precision Recall F-score
Sports 1.00 0.85 0.92
National 0.81 0.89 0.85
International 0.93 0.95 0.94
Business 0.90 0.87 0.89
As we can see from the Table III, the proposed method
achieves the highest performance in estimating the Sports
category and exhibits the lowest performance in identifying
the National category of news.
For a better visual representation, we also show the class-
wise ROC curves of the proposed method in Fig. 3 From
the figure, we can see that for all classes the area under the
curves (AUCs) is significantly high which demonstrates the
effectiveness of our proposed method in news classification.
Fig. 3: Class-wise receiver operating characteristics curve of
proposed method
VI. CONCLUSION
In this study, a method is proposed for news classification
from news title and body using word embedding and CNN
model. To measure the performance of the method, a dataset
is prepared consisting of four different categories of news. The
612
result shows that the proposed method outperforms the base-
line methods to a significant margin. The proposed method can
be used in online news portals for contextual advertisements.
In the future, we plan to increase the size of our dataset as
well as add more news categories. Moreover, extending the
proposed method for the Bangali news can be investigated.
REFERENCES
[1] G. Krishnalal, S. B. Rengarajan, and K. Srinivasagan, “A new text
mining approach based on hmm-svm for web news classification,”
International Journal of Computer Applications, vol. 1, no. 19, pp. 98–
104, 2010.
[2] E. M. Khan, M. S. H. Mukta, M. E. Ali, and J. Mahmud, “Predicting
users’ movie preference and rating behavior from personality and
values,” ACM Transactions on Interactive Intelligent Systems (TiiS),
vol. 10, no. 3, pp. 1–25, 2020.
[3] M. M. Rahman, M. T. H. Majumder, M. S. H. Mukta, M. E. Ali,
and J. Mahmud, “Can we predict eat-out preference of a person from
tweets?,” in Proceedings of the 8th ACM Conference on Web Science,
pp. 350–351, 2016.
[4] M. S. H. Mukta, A. S. Sakib, M. A. Islam, M. E. Ali, M. Ahmed, and
M. A. Rifat, “Friends’ influence driven users’ value change prediction
from social media usage,” SBP-BRiMS, 2021.
[5] G. Septian, A. Susanto, and G. F. Shidik, “Indonesian news classification
based on nabana,” in 2017 International Seminar on Application for
Technology of Information and Communication (iSemantic), pp. 175–
180, IEEE, 2017.
[6] S. Kaur and N. K. Khiva, “Online news classification using deep
learning technique,” International Research Journal of Engineering and
Technology (IRJET), vol. 3, no. 10, pp. 558–563, 2016.
[7] D. Y. Liliana, A. Hardianto, and M. Ridok, “Indonesian news clas-
sification using support vector machine,” World Academy of Science,
Engineering and Technology, vol. 57, pp. 767–770, 2011.
[8] I. Dilrukshi, K. De Zoysa, and A. Caldera, “Twitter news classification
using svm,” in 2013 8th International Conference on Computer Science
& Education, pp. 287–291, IEEE, 2013.
[9] P. Kim, “Convolutional neural network,” in MATLAB deep learning,
pp. 121–147, Springer, 2017.
[10] U. Suleymanov, S. Rustamov, M. Zulfugarov, O. Orujov, N. Musayev,
and A. Alizade, “Empirical study of online news classification using
machine learning approaches,” in 2018 IEEE 12th International Confer-
ence on Application of Information and Communication Technologies
(AICT), pp. 1–6, IEEE, 2018.
[11] H.-T. Duong and V. T. Hoang, “A survey on the multiple classifier for
new benchmark dataset of vietnamese news classification,” in 2019 11th
International Conference on Knowledge and Smart Technology (KST),
pp. 23–28, IEEE, 2019.
[12] M. Islam, F. E. M. Jubayer, S. I. Ahmed, et al., “A comparative study
on different types of approaches to bengali document categorization,”
arXiv preprint arXiv:1701.08694, 2017.
[13] T. B. Shahi and A. K. Pant, “Nepali news classification using na¨
ıve
bayes, support vector machines and neural networks,” in 2018 Inter-
national Conference on Communication Information and Computing
Technology (ICCICT), pp. 1–5, IEEE, 2018.
[14] B. Jang, I. Kim, and J. W. Kim, “Word2vec convolutional neural
networks for classification of news articles and tweets,” PloS one,
vol. 14, no. 8, p. e0220976, 2019.
[15] S. M. Hassan, F. Ali, S. Wasi, S. Javeed, I. Hussain, and S. N. Ashraf,
“Roman-urdu news headline classification with ir models using machine
learning algorithms,” Indian Journal of Science and Technology, vol. 12,
no. 35, pp. 1–9, 2019.
[16] M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood,
and M. T. Sadiq, “Document-level text classification using single-layer
multisize filters convolutional neural network,” IEEE Access, vol. 8,
pp. 42689–42707, 2020.
[17] M. A. Ramdhani, D. S. Maylawati, and T. Mantoro, “Indonesian news
classification using convolutional neural network,” Indonesian Journal of
Electrical Engineering and Computer Science, vol. 19, no. 2, pp. 1000–
1009, 2020.
[18] K. Cho, B. Van Merri¨
enboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[20] A. S. Al Rafi, T. Rahman, A. R. Al Abir, T. A. Rajib, M. Islam, and
M. S. H. Mukta, “A new classification technique: random weighted lstm
(rwl),” in 2020 IEEE Region 10 Symposium (TENSYMP), pp. 262–265,
IEEE, 2020.
[21] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–
32, 2001.
[22] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,”
Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-
780, p. 1612, 1999.
[23] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics
& data analysis, vol. 38, no. 4, pp. 367–378, 2002.
[24] Y.-Y. Song and L. Ying, “Decision tree methods: applications for
classification and prediction,” Shanghai archives of psychiatry, vol. 27,
no. 2, p. 130, 2015.
[25] C. Cortes, “Wsupport-vector network,” Machine Learning, vol. 20,
pp. 1–25, 1995.
[26] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2,
p. 1883, 2009.
613