Conference PaperPDF Available

Different Word Representation for Text Classification: A Comparative Study

November 2019

November 2019

DOI:10.1109/AICCSA47632.2019.9035347

Conference: 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA)

Authors:

Eman Alsqour

King Saud University

Lubna Alhenaki

King Saud University

System workflow.

…

Classifiers' performances.

…

Figures - uploaded by Lubna Alhenaki

Content may be subject to copyright.

Content uploaded by Lubna Alhenaki

Content may be subject to copyright.

Different Word Representation For Text

Classification: A Comparative Study

Eman Alsagour

Computer Science Department

King Saud University

Riyadh, Saudi Arabia

eman.a.alsqour@gmail.com

Lubna Alhenki

Computer Science Department

King Saud University

Riyadh, Saudi Arabia

lubna.henaki@gmail.com

Mohammed Al-Dhelaan

Computer Science Department

King Saud University

Riyadh, Saudi Arabia

mdhelaan@ksu.edu.sa

Abstract—Due to the large amounts of words usually

present in documents, some of their appearances can

complicate the classification process and make it less accurate.

Accordingly, word representation methods have been

employed to handle this issue through the use of a comparative

study. In this study, we compare the effectiveness of both word

embedding and TF-IDF weighting schema by applying four

classifiers to assess the accuracy of the classification. To

evaluate the effectiveness of our study, it was tested on the

popular 20Newsgroup text document dataset. Following our

experimentation, we found that using the TF-IDF method and

ANN classifiers on the 20Newsgroup dataset greatly enhanced

the text documents’ classification compared against the use of

word embedding and other classifiers.

Index Terms—Neural network, TF-IDF, word embedding

I. I

NTRODUCTION

The use of data mining techniques in text documents has

quickly developed over the last few decades. In text

documents, the generated data is vast and too complex to be

analysed and processed using traditional methods. Thus, the

need for a feature extraction method in text data mining has

become essential [1]. Recently, various types of feature

extraction methods using several types of machine learning

classifiers have been applied by a number of researchers

[2],[3], and it has been demonstrated that it successfully

improves the accuracy of classification.

Words can be represented using Term Frequency-

Inverse Document Frequency (TF-IDF) or using word

embedding. TF-IDF weighting schema aim to estimate the

importance of a keyword in particular document (local) as

well as its importance in the entire collection of relevant

documents (global) [4], while word embedding aims to

convert words to vectors using semantic word relationships

[5]. However, the challenge remains to provide accurate

feature extraction for text documents.

Accurate text classification systems are primarily

motivated by the necessity of achieving maximum accuracy

when classifying the text. The main contribution of this

study is to compare the effectiveness of word embedding

and classical TF-IDF weighting schema by applying an

artificial neural network classifier along with three other

classifiers to assess the accuracy of the classification of the

20Newgroup dataset; previous studies used only classical

classification and feature extraction on different text

datasets.

The remaining parts of this paper are organised as

follows: methodology is described in Section II, results and

discussion are detailed in Section III, and conclusions and

future work are discussed in section IV.

II. M

ETHODOLOGY

ESIGN

The overall system workflow, which is divided into two

main phases, is illustrated in Fig. 1. The first phase aims at

data acquisition, and the second phase pre-processes text

documents; the second phase is the core of this work.

Fig. 1. System workflow.

A. Document Pre-Processing

The text in a corpus is first tokenized using built-in

tokenization function in Python then the punctuation, special

characters, and stop words are removed. The Porter

algorithm is used last, for word stemming.

B. Feature Extraction

To study the discriminatory power in differentiating text

classes, tokens are converted into numerical values using

TF-IDF and word embedding.

a) TF-IDF

TF-IDF is the product of two statistical terms, as shown

in Equation (1). The first is the Term Frequency (TF), and

the second is the Inverse Document Frequency (IDF) [4]

W

t,d

= tf

t,d

. idf

 1

where W

t,d

is the weight for feature  (i.e. a word) in

document

, and tf

t,d

 and idf

are defined in Equations (2)

and (3), respectively [4]:

tf

,

=Frequency of term

Total number of term in documents (2)

idf=log N

 (3)

where tf

t,d

 is the term frequency, which calculates the

number of times term  occurs in document , and log

the inverse document frequency, obtained by dividing the

total number of documents, , by the number of documents

containing that term, df

, and using the log of that value.

b) Word embedding

In word embedding, each word is mapped to vectors of

real numbers in a predefined vector space. In this work,

fastText is used for word embedding features[6]. The reason

for choosing this model is that it has been widely used in

studies, such as [2],[3]. Also, it is an open-source and free.

c) Text classification

Four classifiers were selected for the classification tasks:

support-vector machines (SVM), decision trees (DT), Naive

Bayes (NB) classifiers, and artificial neural networks (ANN)

with two fully connected hidden layers.

III. R

ESULTS AND

ISCUSSION

The experiments were conducted on four classes obtained

from the 20Newsgroup benchmark dataset [7]:

comp.graphics, alt.atheism, soc.religion.christian, and

sci.med. Training used 70% of the dataset, and testing used

the other 30%; Tables I and II show the results obtained

using macro and micro averages, respectively.

The results showed term weighting schemes’ superiority

over the pre-trained fastText model, as the complexity of

embedding models necessitates a large corpus to learn

vocabulary features, or else noisy data will lead to false

positives. Also, one of the limitations of word embedding is

its dependency on the selection of the dimension size. The

term weighting schemes made use of all available words

without defining a limited dimension. Therefore, the standard

term weighting schemes provided more informative features

compared to the embedding method.

The best performance recorded was for the ANN, with

an F-score reaching 97% in all experiments, while DT had

the worst performance among the models tested. Fig. 2

shows the overall classifiers’ performances with different

feature types. F-score computes the harmonic mean of

precision and recall.

Fig. 2. Classifiers’ performances.

IV. C

ONCLUSION AND

UTURE

ORK

In this study, we conducted a comparative study of word

representation for text classification which used word

embedding and TF-IDF. The TF, TF-IDF, and word

embedding were used for feature extraction, while the ANN,

SVM, DT, and NB handled text classification. It was

concluded that using TF-IDF and ANN text classification on

the 20Newsgroup dataset greatly enhanced the text

documents’ classification based on precision, recall, and F-

score. Future work should consider a larger dataset and

utilise alternative deep learning models.

EFERENCES

[1] V. Korde, “Text Classification and Classifiers:A Survey,” Int. J.

Artif. Intell. Appl., vol. 3, no. 2, pp. 85–99, Mar. 2012.

[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching

Word Vectors with Subword Information,” Trans. Assoc. Comput.

Linguist., vol. 5, pp. 135–146, Dec. 2017.

[3] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks

for Efficient Text Classification,” ArXiv160701759 Cs, Jul. 2016.

[4] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to

information retrieval, vol. 1. Cambridge university press

Cambridge, 2008.

[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient

Estimation of Word Representations in Vector Space,”

ArXiv13013781 Cs, Jan. 2013.

[6] “fastText.” [Online]. Available: https://fasttext.cc/index.html.

[Accessed: 13-Apr-2019].

[7] Tom Mitchell, “Twenty Newsgroups Data Set.” [Online].

Available:

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups.

[Accessed: 05-Mar-2018].

TABLE I. M

ACRO AVERAGED RESULTS USING DIFFERENT FEATURES METHODS

Feature

type

SVM NB DT ANN

Precision Recall F-score Precision Recall F-score Precision Recall F-score Precision Recall F-score

TF 0.7880 0.7874 0.7868 0.7880 0.8336 0.8361 0.6144 0.6103 0.6091 0.8522 0.8463 0.8483

TF-IDF 0.8542 0.8517 0.8518 0.8622 0.8266 0.8272 0.6542 0. 6536 0. 6528 0.9762 0.9756 0.9759

FastText 0.6137 0.6170 0.6147 0.4435 0.4288 0.4172 0.3389 0.3396 0.3392 0.6447 0.6481 0.6461

TABLE II. M

ICRO AVERAGED RESULTS USING DIFFERENT FEATURES METHODS

Feature

type

SVM NB DT ANN

Precision Recall F-score Precision Recall F-score Precision Recall F-score Precision Recall F-score

TF 0.7905 0.7905 0.7905 0.8393 0.8393 0.8393 0.6166 0.6166 0.6166 0.8510 0.8510 0.8510

TF-IDF 0.8562 0.8562 0.8562 0.8377 0.837 0.8377 0.6613 0.6613 0.6613 0.9761 0.9761 0.9761

FastText 0.6228 0.6228 0.6228 0.4281 0.4281 0.4281 0.3404 0. 3404 0. 3404 0.6542 0. 6542 0. 6542

0.2

0.4

0.6

0.8

SVM NB DT ANN

F-score

Classifers Performance

TF-IDF

FastText

Fake News Classification in Machine Learning with Different Word Representations

Article

Nov 2021

Elaf Alhazmi

Text classification has been effectively applied in a variety of domains, one of which is the detection of fake news. Working with a classification framework is an important approach for detecting fake news. One of the most significant steps in converting text to numbers in a classification framework is feature extraction. In this paper, we compare the effectiveness of several feature extraction approaches such as bag of words, TF-IDF, and one-hot encoding. For the experiment, we measured the accuracy of the classification and evaluated the best/worst classifier in three techniques using three fake news detection data sets and six machine learning classifiers. Following our tests, we discovered that employing a bag of words, also known as CountVectorizer, and the TF-IDF approach in text classification for selected data outperforms one-hot encoding. Despite the fact that logistic regression and support vector machine both produce valid results by using bag of words and TF-IDF, random forest classifier is the only algorithm that consistently produces accurate results in all three feature extraction methods. The accuracy of support vector machine in one-hot encoding was the lowest even though the algorithm produced substantial results in the other two extraction procedures.

Sentiment analysis of students’ Facebook comments toward university announcements

Conference Paper

Mar 2022

Students’ opinions are among the critical indicators to evaluate the university teaching process. However, due to the absence of an official online system in most universities that provides a mechanism for obtaining students’ opinions on several university announcements, most students use various social networks to express their feelings and provide their opinions toward these announcements. We present, through this paper, sentiment analysis of Facebook comments written in the Moroccan Arabic dialect. These comments reflect the opinions of students about university announcements during the COVID-19 pandemic, especially those related to teaching mode and ex-am planning. Then, the comments collected were cleaned, preprocessed, and manually classified into four categories, namely positive, neutral, negative, and bipolar. Further, data dimensionality reduction is applied using TF-IDF and Chi-square test. Finally, we evaluated the performance of three standard classifiers, i.e., Naïve Bayesian (NB), Support Vector Machines (SVM), and Random Forests (RF) using k-fold cross-validation. The results showed that the SVM-based classifier performs as well as the RF-based classifier regarding the classification’s accuracy and F1-score, while the NB-based classifier lags behind them.

A new classification model with a multi-layer dimensionality reduction approach

Conference Paper

May 2022

Documents categorization has evolved into a pertinent study topic, adding new challenges to data mining and information retrieval fields. However, classification algorithms suffer from the large sparse matrix generated from the processed dataset. Therefore, reducing data dimensionality is considered an essential phase in the whole process of classification. This phase aims to remove irrelevant features while enhancing the classifier’s performance. Several feature extraction and selection methods have been introduced and discussed. Still, applying these methods in Arabic topic categorization is not well-studied and left behind in comparison with various languages like English. In this paper, we highlight the benefit of using a hybrid multi-layer dimension reduction approach, especially for Arabic topic classification. The procedure proposed includes four layers of feature extraction and selection. The first two layers apply linguistic methods, while the last two layers use statistical analyses. Then, a bagging classifier model is implemented using the Support Vector Machines (SVM) as its base model. Finally, we evaluated the classifier performance using several measurable criteria such as precision, recall, accuracy, and F-score. According to the literature, the presented model reached an average accuracy that is among the best results reported in previous studies.

Sentiment analysis of Facebook users towards COVID-19 vaccination

Conference Paper

Nov 2023

Social networks have become the first-leading virtual space for expressing and sharing people's opinions. Therefore, many sentiment analysis practitioners are focusing on gathering and analyzing the content generated by different social network users. Believes, thoughts, and events that create controversy are the resources for getting various kinds of feedback, making them the main fuel for different areas of research such as sentiment analysis, hate speech detection, and fake news identification. In this paper, we present a sentiment analysis applied to comments that are publicly shared on Facebook. These comments are expressed in Moroccan Arabic and convey the viewpoints of Moroccan citizens toward the COVID-19 vaccination. In doing so, we collected comments from the Facebook pages of official Moroccan newspapers. Then, we manually annotated the comments compiled into two categories, namely positive and negative. Furthermore, the TF-IDF and Chi-square test were used for data extraction and selection. Finally, we implemented three classifiers namely Naïve Bayesian (NB), Support Vector Machines (SVM), and Random Forests (RF). The results showed that the RF-based classifier achieved the best performance in terms of accuracy and F1-score metrics.

Text Classification and Classifiers:A Survey

Article

Full-text available

Mar 2012

Vandana Korde

As most information (over 80%) is stored as text, text mining is believed to have a high commercialpotential value. knowledge may be discovered from many sources of information; yet, unstructured textsremain the largest readily available source of knowledge .Text classification which classifies thedocuments according to predefined categories .In this paper we are tried to give the introduction of textclassification, process of text classification as well as the overview of the classifiers and tried to comparethe some existing classifier on basis of few criteria like time complexity, principal and performance.

Efficient Estimation of Word Representations in Vector Space

Article

Full-text available

Jan 2013

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Efficient Estimation of Word Representations in Vector Space

Conference Paper

Jan 2013

Bag of Tricks for Efficient Text Classification

Article

Jul 2016

This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

Bag of Tricks for Efficient Text Classification

Conference Paper

Jan 2017

Enriching Word Vectors with Subword Information

Article

Jul 2016

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY

Article

As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare the some existing classifier on basis of few criteria like time complexity, principal and performance.

An Introduction to Information Retrieval DRAFT

Book

Jan 2008

Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.

Twenty Newsgroups Data Set.

Tom Mitchell

Tom Mitchell, "Twenty Newsgroups Data Set." [Online]. Available: http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups. [Accessed: 05-Mar-2018].

Different Word Representation for Text Classification: A Comparative Study

Figures

Recommended publications

A Comparative Study for Arabic Text Classification Based on BOW and Mixed Words Representations

F-GCNN: A Power Defect Texts Classification Model

Hate Speech Text Classification Using Long Short-Term Memory (LSTM)

Constructing Document Vectors Using Kernel Density Estimates

Comparison of Pre-Trained Word Vectors for Arabic Text Classification Using Deep Learning Approach