Content uploaded by Lubna Alhenaki
Author content
All content in this area was uploaded by Lubna Alhenaki on Mar 23, 2020
Content may be subject to copyright.
Different Word Representation For Text
Classification: A Comparative Study
Eman Alsagour
Computer Science Department
King Saud University
Riyadh, Saudi Arabia
eman.a.alsqour@gmail.com
Lubna Alhenki
Computer Science Department
King Saud University
Riyadh, Saudi Arabia
lubna.henaki@gmail.com
Mohammed Al-Dhelaan
Computer Science Department
King Saud University
Riyadh, Saudi Arabia
mdhelaan@ksu.edu.sa
Abstract—Due to the large amounts of words usually
present in documents, some of their appearances can
complicate the classification process and make it less accurate.
Accordingly, word representation methods have been
employed to handle this issue through the use of a comparative
study. In this study, we compare the effectiveness of both word
embedding and TF-IDF weighting schema by applying four
classifiers to assess the accuracy of the classification. To
evaluate the effectiveness of our study, it was tested on the
popular 20Newsgroup text document dataset. Following our
experimentation, we found that using the TF-IDF method and
ANN classifiers on the 20Newsgroup dataset greatly enhanced
the text documents’ classification compared against the use of
word embedding and other classifiers.
Index Terms—Neural network, TF-IDF, word embedding
I. I
NTRODUCTION
The use of data mining techniques in text documents has
quickly developed over the last few decades. In text
documents, the generated data is vast and too complex to be
analysed and processed using traditional methods. Thus, the
need for a feature extraction method in text data mining has
become essential [1]. Recently, various types of feature
extraction methods using several types of machine learning
classifiers have been applied by a number of researchers
[2],[3], and it has been demonstrated that it successfully
improves the accuracy of classification.
Words can be represented using Term Frequency-
Inverse Document Frequency (TF-IDF) or using word
embedding. TF-IDF weighting schema aim to estimate the
importance of a keyword in particular document (local) as
well as its importance in the entire collection of relevant
documents (global) [4], while word embedding aims to
convert words to vectors using semantic word relationships
[5]. However, the challenge remains to provide accurate
feature extraction for text documents.
Accurate text classification systems are primarily
motivated by the necessity of achieving maximum accuracy
when classifying the text. The main contribution of this
study is to compare the effectiveness of word embedding
and classical TF-IDF weighting schema by applying an
artificial neural network classifier along with three other
classifiers to assess the accuracy of the classification of the
20Newgroup dataset; previous studies used only classical
classification and feature extraction on different text
datasets.
The remaining parts of this paper are organised as
follows: methodology is described in Section II, results and
discussion are detailed in Section III, and conclusions and
future work are discussed in section IV.
II. M
ETHODOLOGY
D
ESIGN
The overall system workflow, which is divided into two
main phases, is illustrated in Fig. 1. The first phase aims at
data acquisition, and the second phase pre-processes text
documents; the second phase is the core of this work.
Fig. 1. System workflow.
A. Document Pre-Processing
The text in a corpus is first tokenized using built-in
tokenization function in Python then the punctuation, special
characters, and stop words are removed. The Porter
algorithm is used last, for word stemming.
B. Feature Extraction
To study the discriminatory power in differentiating text
classes, tokens are converted into numerical values using
TF-IDF and word embedding.
a) TF-IDF
TF-IDF is the product of two statistical terms, as shown
in Equation (1). The first is the Term Frequency (TF), and
the second is the Inverse Document Frequency (IDF) [4]
W
t,d
= tf
t,d
. idf
t
1
where W
t,d
is the weight for feature (i.e. a word) in
document
d
, and tf
t,d
and idf
t
are defined in Equations (2)
and (3), respectively [4]:
tf
,
=Frequency of term
Total number of term in documents (2)
978-1-7281-5052-9/19/$31.00 ©2019 IEEE
idf=log N
df
t
(3)
where tf
t,d
is the term frequency, which calculates the
number of times term occurs in document , and log
N
df
t
is
the inverse document frequency, obtained by dividing the
total number of documents, , by the number of documents
containing that term, df
t
, and using the log of that value.
b) Word embedding
In word embedding, each word is mapped to vectors of
real numbers in a predefined vector space. In this work,
fastText is used for word embedding features[6]. The reason
for choosing this model is that it has been widely used in
studies, such as [2],[3]. Also, it is an open-source and free.
c) Text classification
Four classifiers were selected for the classification tasks:
support-vector machines (SVM), decision trees (DT), Naive
Bayes (NB) classifiers, and artificial neural networks (ANN)
with two fully connected hidden layers.
III. R
ESULTS AND
D
ISCUSSION
The experiments were conducted on four classes obtained
from the 20Newsgroup benchmark dataset [7]:
comp.graphics, alt.atheism, soc.religion.christian, and
sci.med. Training used 70% of the dataset, and testing used
the other 30%; Tables I and II show the results obtained
using macro and micro averages, respectively.
The results showed term weighting schemes’ superiority
over the pre-trained fastText model, as the complexity of
embedding models necessitates a large corpus to learn
vocabulary features, or else noisy data will lead to false
positives. Also, one of the limitations of word embedding is
its dependency on the selection of the dimension size. The
term weighting schemes made use of all available words
without defining a limited dimension. Therefore, the standard
term weighting schemes provided more informative features
compared to the embedding method.
The best performance recorded was for the ANN, with
an F-score reaching 97% in all experiments, while DT had
the worst performance among the models tested. Fig. 2
shows the overall classifiers’ performances with different
feature types. F-score computes the harmonic mean of
precision and recall.
Fig. 2. Classifiers’ performances.
IV. C
ONCLUSION AND
F
UTURE
W
ORK
In this study, we conducted a comparative study of word
representation for text classification which used word
embedding and TF-IDF. The TF, TF-IDF, and word
embedding were used for feature extraction, while the ANN,
SVM, DT, and NB handled text classification. It was
concluded that using TF-IDF and ANN text classification on
the 20Newsgroup dataset greatly enhanced the text
documents’ classification based on precision, recall, and F-
score. Future work should consider a larger dataset and
utilise alternative deep learning models.
R
EFERENCES
[1] V. Korde, “Text Classification and Classifiers:A Survey,” Int. J.
Artif. Intell. Appl., vol. 3, no. 2, pp. 85–99, Mar. 2012.
[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching
Word Vectors with Subword Information,” Trans. Assoc. Comput.
Linguist., vol. 5, pp. 135–146, Dec. 2017.
[3] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks
for Efficient Text Classification,” ArXiv160701759 Cs, Jul. 2016.
[4] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to
information retrieval, vol. 1. Cambridge university press
Cambridge, 2008.
[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
Estimation of Word Representations in Vector Space,”
ArXiv13013781 Cs, Jan. 2013.
[6] “fastText.” [Online]. Available: https://fasttext.cc/index.html.
[Accessed: 13-Apr-2019].
[7] Tom Mitchell, “Twenty Newsgroups Data Set.” [Online].
Available:
http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups.
[Accessed: 05-Mar-2018].
TABLE I. M
ACRO AVERAGED RESULTS USING DIFFERENT FEATURES METHODS
.
Feature
type
SVM NB DT ANN
Precision Recall F-score Precision Recall F-score Precision Recall F-score Precision Recall F-score
TF 0.7880 0.7874 0.7868 0.7880 0.8336 0.8361 0.6144 0.6103 0.6091 0.8522 0.8463 0.8483
TF-IDF 0.8542 0.8517 0.8518 0.8622 0.8266 0.8272 0.6542 0. 6536 0. 6528 0.9762 0.9756 0.9759
FastText 0.6137 0.6170 0.6147 0.4435 0.4288 0.4172 0.3389 0.3396 0.3392 0.6447 0.6481 0.6461
TABLE II. M
ICRO AVERAGED RESULTS USING DIFFERENT FEATURES METHODS
.
Feature
type
SVM NB DT ANN
Precision Recall F-score Precision Recall F-score Precision Recall F-score Precision Recall F-score
TF 0.7905 0.7905 0.7905 0.8393 0.8393 0.8393 0.6166 0.6166 0.6166 0.8510 0.8510 0.8510
TF-IDF 0.8562 0.8562 0.8562 0.8377 0.837 0.8377 0.6613 0.6613 0.6613 0.9761 0.9761 0.9761
FastText 0.6228 0.6228 0.6228 0.4281 0.4281 0.4281 0.3404 0. 3404 0. 3404 0.6542 0. 6542 0. 6542
0
0.2
0.4
0.6
0.8
1
SVM NB DT ANN
F-score
Classifers Performance
TF
TF-IDF
FastText