Conference PaperPDF Available

Different Word Representation for Text Classification: A Comparative Study

Authors:

Figures

Content may be subject to copyright.
Different Word Representation For Text
Classification: A Comparative Study
Eman Alsagour
Computer Science Department
King Saud University
Riyadh, Saudi Arabia
eman.a.alsqour@gmail.com
Lubna Alhenki
Computer Science Department
King Saud University
Riyadh, Saudi Arabia
lubna.henaki@gmail.com
Mohammed Al-Dhelaan
Computer Science Department
King Saud University
Riyadh, Saudi Arabia
mdhelaan@ksu.edu.sa
Abstract—Due to the large amounts of words usually
present in documents, some of their appearances can
complicate the classification process and make it less accurate.
Accordingly, word representation methods have been
employed to handle this issue through the use of a comparative
study. In this study, we compare the effectiveness of both word
embedding and TF-IDF weighting schema by applying four
classifiers to assess the accuracy of the classification. To
evaluate the effectiveness of our study, it was tested on the
popular 20Newsgroup text document dataset. Following our
experimentation, we found that using the TF-IDF method and
ANN classifiers on the 20Newsgroup dataset greatly enhanced
the text documents’ classification compared against the use of
word embedding and other classifiers.
Index Terms—Neural network, TF-IDF, word embedding
I. I
NTRODUCTION
The use of data mining techniques in text documents has
quickly developed over the last few decades. In text
documents, the generated data is vast and too complex to be
analysed and processed using traditional methods. Thus, the
need for a feature extraction method in text data mining has
become essential [1]. Recently, various types of feature
extraction methods using several types of machine learning
classifiers have been applied by a number of researchers
[2],[3], and it has been demonstrated that it successfully
improves the accuracy of classification.
Words can be represented using Term Frequency-
Inverse Document Frequency (TF-IDF) or using word
embedding. TF-IDF weighting schema aim to estimate the
importance of a keyword in particular document (local) as
well as its importance in the entire collection of relevant
documents (global) [4], while word embedding aims to
convert words to vectors using semantic word relationships
[5]. However, the challenge remains to provide accurate
feature extraction for text documents.
Accurate text classification systems are primarily
motivated by the necessity of achieving maximum accuracy
when classifying the text. The main contribution of this
study is to compare the effectiveness of word embedding
and classical TF-IDF weighting schema by applying an
artificial neural network classifier along with three other
classifiers to assess the accuracy of the classification of the
20Newgroup dataset; previous studies used only classical
classification and feature extraction on different text
datasets.
The remaining parts of this paper are organised as
follows: methodology is described in Section II, results and
discussion are detailed in Section III, and conclusions and
future work are discussed in section IV.
II. M
ETHODOLOGY
D
ESIGN
The overall system workflow, which is divided into two
main phases, is illustrated in Fig. 1. The first phase aims at
data acquisition, and the second phase pre-processes text
documents; the second phase is the core of this work.
Fig. 1. System workflow.
A. Document Pre-Processing
The text in a corpus is first tokenized using built-in
tokenization function in Python then the punctuation, special
characters, and stop words are removed. The Porter
algorithm is used last, for word stemming.
B. Feature Extraction
To study the discriminatory power in differentiating text
classes, tokens are converted into numerical values using
TF-IDF and word embedding.
a) TF-IDF
TF-IDF is the product of two statistical terms, as shown
in Equation (1). The first is the Term Frequency (TF), and
the second is the Inverse Document Frequency (IDF) [4]
W
t,d
= tf
t,d
. idf
t
 1
where W
t,d
is the weight for feature (i.e. a word) in
document
d
, and tf
t,d
and idf
t
are defined in Equations (2)
and (3), respectively [4]:
tf
,
=Frequency of term
Total number of term in documents (2)
978-1-7281-5052-9/19/$31.00 ©2019 IEEE
idf=log N
df
t
 (3)
where tf
t,d
is the term frequency, which calculates the
number of times term occurs in document , and log
N
df
t
is
the inverse document frequency, obtained by dividing the
total number of documents, , by the number of documents
containing that term, df
t
, and using the log of that value.
b) Word embedding
In word embedding, each word is mapped to vectors of
real numbers in a predefined vector space. In this work,
fastText is used for word embedding features[6]. The reason
for choosing this model is that it has been widely used in
studies, such as [2],[3]. Also, it is an open-source and free.
c) Text classification
Four classifiers were selected for the classification tasks:
support-vector machines (SVM), decision trees (DT), Naive
Bayes (NB) classifiers, and artificial neural networks (ANN)
with two fully connected hidden layers.
III. R
ESULTS AND
D
ISCUSSION
The experiments were conducted on four classes obtained
from the 20Newsgroup benchmark dataset [7]:
comp.graphics, alt.atheism, soc.religion.christian, and
sci.med. Training used 70% of the dataset, and testing used
the other 30%; Tables I and II show the results obtained
using macro and micro averages, respectively.
The results showed term weighting schemes’ superiority
over the pre-trained fastText model, as the complexity of
embedding models necessitates a large corpus to learn
vocabulary features, or else noisy data will lead to false
positives. Also, one of the limitations of word embedding is
its dependency on the selection of the dimension size. The
term weighting schemes made use of all available words
without defining a limited dimension. Therefore, the standard
term weighting schemes provided more informative features
compared to the embedding method.
The best performance recorded was for the ANN, with
an F-score reaching 97% in all experiments, while DT had
the worst performance among the models tested. Fig. 2
shows the overall classifiers’ performances with different
feature types. F-score computes the harmonic mean of
precision and recall.
Fig. 2. Classifiers’ performances.
IV. C
ONCLUSION AND
F
UTURE
W
ORK
In this study, we conducted a comparative study of word
representation for text classification which used word
embedding and TF-IDF. The TF, TF-IDF, and word
embedding were used for feature extraction, while the ANN,
SVM, DT, and NB handled text classification. It was
concluded that using TF-IDF and ANN text classification on
the 20Newsgroup dataset greatly enhanced the text
documents’ classification based on precision, recall, and F-
score. Future work should consider a larger dataset and
utilise alternative deep learning models.
R
EFERENCES
[1] V. Korde, “Text Classification and Classifiers:A Survey,” Int. J.
Artif. Intell. Appl., vol. 3, no. 2, pp. 85–99, Mar. 2012.
[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching
Word Vectors with Subword Information,” Trans. Assoc. Comput.
Linguist., vol. 5, pp. 135–146, Dec. 2017.
[3] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks
for Efficient Text Classification,” ArXiv160701759 Cs, Jul. 2016.
[4] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to
information retrieval, vol. 1. Cambridge university press
Cambridge, 2008.
[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
Estimation of Word Representations in Vector Space,”
ArXiv13013781 Cs, Jan. 2013.
[6] “fastText.” [Online]. Available: https://fasttext.cc/index.html.
[Accessed: 13-Apr-2019].
[7] Tom Mitchell, “Twenty Newsgroups Data Set.” [Online].
Available:
http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups.
[Accessed: 05-Mar-2018].
TABLE I. M
ACRO AVERAGED RESULTS USING DIFFERENT FEATURES METHODS
.
Feature
type
SVM NB DT ANN
Precision Recall F-score Precision Recall F-score Precision Recall F-score Precision Recall F-score
TF 0.7880 0.7874 0.7868 0.7880 0.8336 0.8361 0.6144 0.6103 0.6091 0.8522 0.8463 0.8483
TF-IDF 0.8542 0.8517 0.8518 0.8622 0.8266 0.8272 0.6542 0. 6536 0. 6528 0.9762 0.9756 0.9759
FastText 0.6137 0.6170 0.6147 0.4435 0.4288 0.4172 0.3389 0.3396 0.3392 0.6447 0.6481 0.6461
TABLE II. M
ICRO AVERAGED RESULTS USING DIFFERENT FEATURES METHODS
.
Feature
type
SVM NB DT ANN
Precision Recall F-score Precision Recall F-score Precision Recall F-score Precision Recall F-score
TF 0.7905 0.7905 0.7905 0.8393 0.8393 0.8393 0.6166 0.6166 0.6166 0.8510 0.8510 0.8510
TF-IDF 0.8562 0.8562 0.8562 0.8377 0.837 0.8377 0.6613 0.6613 0.6613 0.9761 0.9761 0.9761
FastText 0.6228 0.6228 0.6228 0.4281 0.4281 0.4281 0.3404 0. 3404 0. 3404 0.6542 0. 6542 0. 6542
0
0.2
0.4
0.6
0.8
1
SVM NB DT ANN
F-score
Classifers Performance
TF
TF-IDF
FastText
... Despite the fact that the accuracy of a neural network with bag of word remains the same, neural networks take longer to train and are more sophisticated. The efficacy of employing word embedding and TF-IDF in text categorization was compared in the two research studies [4] and [23]. The former finding concludes that TF-IDF outperformed word embedding (FastText), with the ANN achieving the greatest results, with an F1-score of 97 percent in all experiments. ...
... Three public data sets were used in the experiment 4 . Data training actually took 80 percent of the data set, while testing took up the remaining 20 percent. ...
Article
Text classification has been effectively applied in a variety of domains, one of which is the detection of fake news. Working with a classification framework is an important approach for detecting fake news. One of the most significant steps in converting text to numbers in a classification framework is feature extraction. In this paper, we compare the effectiveness of several feature extraction approaches such as bag of words, TF-IDF, and one-hot encoding. For the experiment, we measured the accuracy of the classification and evaluated the best/worst classifier in three techniques using three fake news detection data sets and six machine learning classifiers. Following our tests, we discovered that employing a bag of words, also known as CountVectorizer, and the TF-IDF approach in text classification for selected data outperforms one-hot encoding. Despite the fact that logistic regression and support vector machine both produce valid results by using bag of words and TF-IDF, random forest classifier is the only algorithm that consistently produces accurate results in all three feature extraction methods. The accuracy of support vector machine in one-hot encoding was the lowest even though the algorithm produced substantial results in the other two extraction procedures.
... In fact, many term weighting techniques have been investigated [9], and based on the findings, the TF-IDF is claimed to be more beneficial than the others in terms of text classification. Similar claim was reported from comparative studies that addressed Arabic text classification (e.g., [10], [11]). The TF-IDF value correlated to a given term is calculated as follows: ...
Conference Paper
Students’ opinions are among the critical indicators to evaluate the university teaching process. However, due to the absence of an official online system in most universities that provides a mechanism for obtaining students’ opinions on several university announcements, most students use various social networks to express their feelings and provide their opinions toward these announcements. We present, through this paper, sentiment analysis of Facebook comments written in the Moroccan Arabic dialect. These comments reflect the opinions of students about university announcements during the COVID-19 pandemic, especially those related to teaching mode and ex-am planning. Then, the comments collected were cleaned, preprocessed, and manually classified into four categories, namely positive, neutral, negative, and bipolar. Further, data dimensionality reduction is applied using TF-IDF and Chi-square test. Finally, we evaluated the performance of three standard classifiers, i.e., Naïve Bayesian (NB), Support Vector Machines (SVM), and Random Forests (RF) using k-fold cross-validation. The results showed that the SVM-based classifier performs as well as the RF-based classifier regarding the classification’s accuracy and F1-score, while the NB-based classifier lags behind them.
... In 1988, Salton and Buckley [5] have discussed many term weighting methods and claimed that TF-IDF is the best. Since then, many similar comparative studies have been done to investigate vectorization techniques, and many of them reported that TF-IDF has better performance (e.g., [30]- [32]). Hence, the TF-IDF method is used in this study. ...
Conference Paper
Documents categorization has evolved into a pertinent study topic, adding new challenges to data mining and information retrieval fields. However, classification algorithms suffer from the large sparse matrix generated from the processed dataset. Therefore, reducing data dimensionality is considered an essential phase in the whole process of classification. This phase aims to remove irrelevant features while enhancing the classifier’s performance. Several feature extraction and selection methods have been introduced and discussed. Still, applying these methods in Arabic topic categorization is not well-studied and left behind in comparison with various languages like English. In this paper, we highlight the benefit of using a hybrid multi-layer dimension reduction approach, especially for Arabic topic classification. The procedure proposed includes four layers of feature extraction and selection. The first two layers apply linguistic methods, while the last two layers use statistical analyses. Then, a bagging classifier model is implemented using the Support Vector Machines (SVM) as its base model. Finally, we evaluated the classifier performance using several measurable criteria such as precision, recall, accuracy, and F-score. According to the literature, the presented model reached an average accuracy that is among the best results reported in previous studies.
Conference Paper
Social networks have become the first-leading virtual space for expressing and sharing people's opinions. Therefore, many sentiment analysis practitioners are focusing on gathering and analyzing the content generated by different social network users. Believes, thoughts, and events that create controversy are the resources for getting various kinds of feedback, making them the main fuel for different areas of research such as sentiment analysis, hate speech detection, and fake news identification. In this paper, we present a sentiment analysis applied to comments that are publicly shared on Facebook. These comments are expressed in Moroccan Arabic and convey the viewpoints of Moroccan citizens toward the COVID-19 vaccination. In doing so, we collected comments from the Facebook pages of official Moroccan newspapers. Then, we manually annotated the comments compiled into two categories, namely positive and negative. Furthermore, the TF-IDF and Chi-square test were used for data extraction and selection. Finally, we implemented three classifiers namely Naïve Bayesian (NB), Support Vector Machines (SVM), and Random Forests (RF). The results showed that the RF-based classifier achieved the best performance in terms of accuracy and F1-score metrics.
Article
Full-text available
As most information (over 80%) is stored as text, text mining is believed to have a high commercialpotential value. knowledge may be discovered from many sources of information; yet, unstructured textsremain the largest readily available source of knowledge .Text classification which classifies thedocuments according to predefined categories .In this paper we are tried to give the introduction of textclassification, process of text classification as well as the overview of the classifiers and tried to comparethe some existing classifier on basis of few criteria like time complexity, principal and performance.
Article
Full-text available
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare the some existing classifier on basis of few criteria like time complexity, principal and performance.
Book
Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.
Twenty Newsgroups Data Set.
  • Tom Mitchell
Tom Mitchell, "Twenty Newsgroups Data Set." [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups. [Accessed: 05-Mar-2018].