ChapterPDF Available

Text Classification Using Convolution Neural Networks with FastText Embedding

April 2021

April 2021

DOI:10.1007/978-3-030-73050-5_11

In book: Hybrid Intelligent Systems (pp.103-113)

Authors:

Md. Rajib Hossain

Chittagong University of Engineering & Technology

Moshiul Hoque

Chittagong University of Engineering & Technology

Iqbal H. Sarker

Edith Cowan University

Text classification has a growing interest among NLP researchers due to its tremendous availability on online platforms and emergence on various Web 2.0 applications. Recently, text classification in resource-constrained languages has been bringing much attention due to the sharp increase of digital resources. This paper presents a CNN based text classification model for one of the low resource languages like Bengali. The goal of the Bengali text classification is to assign a particular category to a text into one of the pre-defined categories based on its semantic and syntactic meaning. The proposed system comprises of four key modules: embedding model generation, Text to feature representation, training, and testing. The classification system trained and validated with 39, 079 and 6, 000 text datasets. Experimental evaluation with 9, 779 test datasets shows the accuracy of \(96.85\%\), which indicates the superior performance compared to the existing techniques.

CNN based Bengali text classification framework.

…

Effect of training and validation accuracy/loss on epoch numbers.

…

ROC curves for text classification model.

…

Statistics of embedding and categorical corpus

…

Optimized hyperparameters for CNN and Embedding

…

Figures - uploaded by Md. Rajib Hossain

Content may be subject to copyright.

Content uploaded by Md. Rajib Hossain

Content may be subject to copyright.

Content uploaded by Md. Rajib Hossain

Content may be subject to copyright.

Text Classiﬁcation Using Convolution Neural

Networks with FastText Embedding

Md. Rajib Hossain ID , Mohammed Moshiul Hoque ID , and Iqbal H. Sarker ID

Department of Computer Science and Engineering, Chittagong University of

Engineering and Technology, Chittagong-4349, Bangladesh

rajcsecuet@gmail.com,moshiul 240@cuet.ac.bd∗,iqbal@cuet.ac.bd

Abstract. Text classiﬁcation has a growing interest among NLP re-

searchers due to its tremendous availability on online platforms and

emergence on various Web 2.0 applications. Recently, text classiﬁcation

in resource-constrained languages has been bringing much attention due

to the sharp increase of digital resources. This paper presents a CNN

based text classiﬁcation model for one of the low resource languages like

Bengali. The goal of the Bengali text classiﬁcation is to assign a par-

ticular category to a text into one of the pre-deﬁned categories based

on its semantic and syntactic meaning. The proposed system comprises

of four key modules: embedding model generation, Text to feature rep-

resentation, training, and testing. The classiﬁcation system trained and

validated with 39,079 and 6,000 text datasets. Experimental evaluation

with 9,779 test datasets shows the accuracy of 96.85%, which indicates

the superior performance compared to the existing techniques.

Keywords: Natural language processing, text classiﬁcation, Feature rep-

resentation, Convolutional neural networks, and Evaluation

1 Introduction

In the recent year, data storage on the world wide web increased enormously

due to the eﬀortless use of electronic gadgets in web 2.0 applications and the

availability of the Internet. Much amount of these data are available in textual

forms and unstructured. These enormous amounts of unstructured text data

require to arrange eﬃciently so that sorting, manipulating and searching tasks

can perform quickly or easily. However, manual classiﬁcation of voluminous data

into their pre-deﬁned classes demands huge time, enormous eﬀort and cost of

money which may inaccurate, or infeasible in most cases. Thus, the automatic

text classiﬁcation technique is one of the agile solutions to process such a large

amount of text data that signiﬁcantly reduce human labour, saves time, and cost

of money. Classiﬁcation of text document implies to the task of automatically

assigning a class or category to a textual data chosen from a set of predetermined

levels. The text classiﬁcation system may be utilized by the security agency to

identify the rumours streamed data or spam detection, the circadian newspapers

2 M. R. Hossain et al.

to organize news by subject categories, the library to relegate papers or books,

the hospitals to categorize patient predicated on the diagnosis.

Although Bengali is the 7th most widely spoken language in the world where

about 245 million people are communicating via Bengali, it has been considered

one of the low-resource languages [1]. It is a very complicated task to develop an

automatic text classiﬁcation system for low-resource languages such as Bengali.

Scarcity of digital resources and deﬁciency of benchmark corpora makes the task

more challenging. With the taking into consideration of the current constraints,

we proposed a CNN based Bengali text classiﬁcation system with FastText word

embedding technique. CNN based methods performing composition over word

vectors to extract complex features, has been proven to be eﬀective classiﬁers

and achieve excellent performance on diﬀerent text classiﬁcation. The major

contributions of this research are:

–Develop a corpus containing 150,000 Bengali text documents for word em-

bedding and 54,858 text documents to classify into 6 classes.

–Investigate optimized hyperparameters for FastText and CNN algorithms.

–Develop CNN based classiﬁer model to classify the Bengali text on a self-

developed dataset.

–Evaluate the performance of the proposed CNN text classiﬁcation model on

the developed dataset.

2 Related Work

There is signiﬁcant progress on text classiﬁcation in English, Arabic, Chinese

and some European languages [4], [5]. However, the text classiﬁcation problem

is in rudimentary stage till to date in the realm of Bengali language. Mikolov

et al. [6] developed a shallow neural network-based word embedding model

(Word2Vec), which carried out both semantic and syntactic features. Word-word

co-occurrence based model (GloVe) also covered the semantic and syntactic fea-

tures [7]. GloVe and Word2Vec both can represent the word information, but

these techniques are failed to deal with sub-word details and out of vocabulary

problem. Bojanowski et al. [8] have developed a sub-word knowledge-based em-

bedding model (FastText) which overcame the problem of Glove and Word2Vec

feature representation techniques. Unfortunately, FastText feature representa-

tion technique is not well explored till to data for low resource language like

Bengali due to shortage of text corpus. A Hierarchical Deep Learning (HDL)

based text classiﬁcation with Golve embeddings is introduced by Kowsari et al.

[9], which obtained 90.93% accuracy.

Few studies have conducted on text classiﬁcation in low resource languages

including Bengali and Hindi which are mostly based on machine learning such

as SVM, Stochastic Gradient Descent (SGD) and Decision Tree. Pal et al. [17]

developed a Hindi poem classiﬁcation system using Naive Bayes which achieved

64% accuracy in three classes. Hossain et al. [2] developed a Bengali text clas-

siﬁcation system using DCNN with GloVe feature extraction and achieved an

accuracy of 94.96% for 12 categories. Karim et al. [10] developed a convolution

Text Classiﬁcation 3

LSTM based Bengali text classiﬁcation system, which achieved 92.30% accu-

racy for ﬁve classes. Rahman et al. [19] developed a deep CNN based emotion

classiﬁcation system which achieved 75.57% F1−score for ﬁve emotion cate-

gories. A CNN based model with FastText embedding is developed by Joshi

et al. [18] for Hindi text classiﬁcation, which obtained 92.8% accuracy in six

document categories. However, most of the previous approaches in low resource

languages, including Bengali suﬀered from the out-of-vocabulary problem and

lacking considering the sub-word information which is essential to gain better

classiﬁcation performance. The proposed model considers FastText embedding

with CNN classiﬁcation model, which reduce the weakness of existing techniques.

3 Proposed Methodology

The proposed framework comprises four essential components: FastText embed-

ding model generation, text to feature representation, training, and testing. Fig.

1 depicts the overview of the CNN based text classiﬁcation framework.

Fig. 1: CNN based Bengali text classiﬁcation framework.

3.1 Embedding Model Generation

FastText [8] algorithm used to generate embedding model which initialize with

the embedding corpus (EC ). The EC consists of several texts such as EC =

{t1, t2, t3, ..., tE}and ith text ti= [1,2,3, ..., E]. Edenotes the number of embed-

ding text in EC . The texts in EC used as the input of the FastText and generates

4 M. R. Hossain et al.

an embedding model (EM). Corpus to single ﬁle conversion process, takes the

embedding texts (t1−tE) sequentially and marge one after another to gener-

ate a single ﬁle as embedding corpus. The pre-processing step removes all the

non-Bengali alphabets, mathematical symbols, HTML tags, and non-Unicode

symbols. FastText training algorithm considers the embedding corpus ﬁle as the

input and produces EM with the dimension of ((W×F)∈(750000 ×300))

(W×F). The symbol W(W= 750000) denotes the number of the unique

words, and F(F= 300) indicates the feature dimension in EM.

3.2 Text to Feature Representation Module

Labelled text use as the input of the feature representation module during the

training phase, which passes to the tokenization process. The tokenization pro-

cess split the input text into a word list. FastText feature map process takes

both the word list and the EM input. For each of the word in the word list, Fast-

Text feature mapping extracts a total of 200 features where the feature values

as fashioned in rows. Finally, the FastText feature mapping process generates a

feature matrix (FM) with F M ∈(1024×200). If a text contains more than 1024

word, then the process truncates the ﬁrst 1024 words. It added zero paddings if

the text consists of less than 1024 words.

3.3 Text Classiﬁcation Training Module

The training module takes FM as the input and build a classiﬁer model. The

CNN start with a input layer (IL) where IL={I1, I2, I3, ..., In}and Idenotes

the input node and nindicates the nodes size. Input layer followed by a multi-

kernel convolutional and ReLU layer. The convolutional layer (Conv) deﬁne as

C=C1, C2, C3, ..., Cp∨q∨rwhere Cdenotes tensor node and p, q, r indicates the

tensor dimension. There are three diﬀerent Conv operations performed with IL

feature matrix. The ﬁrst Cvon kernel size is (3,3) with tensor size of (128,3,200),

Second Conv kernel size is (5,5) with tensor of (127,5,200) and the third Conv

kernel size is (7,7) with tensor size of (128,7,200). The Conv operation performed

by using the Eq. (1).

Aith:=

j=h

j=1

(I[j:200])⊗(K[ith :200]) (1)

here, Aith:denotes the ith Conv operation output and hindicates the tensor

height. The output tensor size of these three Conv layers are (128,1,1022),

(128,1,1020), and (128,1,1018) respectively. ReLu operation applied to each of

the output tensor. The pooling layer deﬁne as PL={P1, P2, P3, ..., Px∨y∨z},

where x, y, z indicates the tensor size. The max-pool with kernels (1022,1),

(1020,1) and (1018,1) are applied to the pooling layer (PL). Each of the max-

pool extracted 128 feature values.

The ConC at layer inputted the output of the pooling layer and concatenated

one after another, which produces the dense vector of 384 dimension. Dropout

Text Classiﬁcation 5

layer takes the dense layer input and blocking some node based on the dropout

values. The feature vector (384 dimension) uses as input of the output layer

and generates an expected value with a class label. The error value is deter-

mined from this predicted class value and update the kernel weights using the

backpropagation technique [11]. The process is continuing until the convergence

occurred in the training phase. In our model, we observed that the training pro-

cess converged at the epoch number in between 25to30 and ﬁnally the training

output saved as a hierarchical ﬁle format (.meta).

3.4 Testing Module

The testing module consider the unlabeled text as the input and determine a

class name of that input. Initially, the unlabeled text pass through the feature

representation module which generates a feature matrix (FM) (1024×200)). This

FM is sent to the testing module along with the training model which produces

a score vector (S). The score is calculated by using the Eq. (2).

S[1:j]j=6

j=1 =eW×Xj

Pi=6

i=1 eW×Xi

(2)

here, S[1:j]denotes the output score and S={S1, S2, S3, ..., S6}.Wdenotes

the weights matrix and Xsubscript represent the feature vector. The expected

output is the maximum value of S.

4 Experiments

The proposed text classiﬁcation model implemented in a multi-core processor

with NVIDIA GTX 1070 GPU. The size of the physical memory is 32GB,

with GPU internal memory of 8GB. The CNN architecture has deployed in

the Tensor-Flow framework of Python 3.6 framework.

4.1 Text Corpus

Owing to the unavailability of a benchmark corpus in Bengali language, we

developed a corpus to serve our purpose by considering four main steps: data

crawling, pre-processing, hand annotation, and veriﬁcation. The crawler crawled

data from accessible online resources such as blogs, newspapers, and e-books.

Each of the source text is encoded as UTF-8 and stored in ∗.txt form. Unlabelled

crawled data (150,000 text documents) are used to word embedding model during

the training phase. Around 25,000 data are discarded during the pre-processing

phase and remaining 125,000 data are used for hand annotation. In the hand

annotation phase, ﬁve annotators inspected each text and labelled into one of

the six categories such as accident, crime, entertainment, health, politics and

sports. The initial label of 85,000 data are settled based on majority voting of

the annotators whereas rest 40,000 data are discarded due to ill-formatted. One

6 M. R. Hossain et al.

language expert assigned to verify 85,000 labelled data manually. Finally, the

corpus included 54,858 veriﬁed labelled data based on the opinion of the expert.

Table 1 depicts few characteristics of the developed corpus.

Table 1: Statistics of embedding and categorical corpus

Embedding attributes Value Categorical attributes Value

No. of text 150,000 No. of classes 6

No. of sentence 287,000 No. of text 54,858

No. of words 166,381,093 No. of sentence 150,620

No. of unique words 1,350,049 No. of words 1,506,200

Tru. vocab. min count 2 750,196 No. of unique words 560,150

5 Results and Analysis

The performance of the proposed model evaluates in two phases: training or

validation phase and testing phase. The loss and accuracy is calculated in train-

ing/validation phase. In the case of testing phase, precision (Pr), recall (Rc),

accuracy (Ac), and F1-measure is used as measures.

We adjusted hyper-parameters of the word embedding model and CNN based

on our developed corpus for better performance. After performing hundreds

of experiments on the developed corpus, optimised hyper-parameter values are

found (Table 2).

Table 2: Optimized hyperparameters for CNN and Embedding

Embedding hyperparameters Value CNN hyperparameters Value

Embedding dimension 200 Kernel size 3,5,7

Model skipgram No. of kernel 128

Minimum word count 2 Batch size 256

Window size 15 Dropout 0.46

Max. n-gram 7 Epoch 30

Min n-gram 3 Loss type Categorical cross-entropy

lr 0.10 lr 0.087

A total of 39,079 data allocated for training, 6Kdata for validation and

39,079 data for testing. The convergences of the classiﬁer model depend on the

diﬀerences between validation accuracy and training accuracy. Fig. 2 shows the

progress of model convergences in terms of the number of epochs.

The training starts from 0.21, continues upward from epoch 1 to 17 and

converge at 22 with the maximum accuracy 100.00%. The validation accuracy

Text Classiﬁcation 7

(a) Accuracy vs epochs at training

and validation phase.

(b) Loss vs epochs at training and

validation phase.

Fig. 2: Eﬀect of training and validation accuracy/loss on epoch numbers.

starts from 89 and converges at epoch number20 with the highest accuracy of

0.97. The training loss initialises with 6.23 and stable at 20 epoch while the

validation loss starts from 0.21 and stable at epoch 20. Thus, the results reveal

that the classiﬁer model converges at epoch 20.

Table 3 illustrates the performance of the text classiﬁer model on test datasets

in terms of precision, recall and F1-score. The Health category achieved the

highest precision (98.00%), recall (98.00%) and F1-score (98.00%), whereas the

Crime category, gained the lowest precision (96.00%), recall (96.00%) and F1-

score (96.00%). Due to the semantic and syntactic similarity between intraclass,

Table 3: Test time classiﬁer model performance summary.

Class names Pr (%) Rc (%) F1-measure No. of test texts

Accident 96.00 97.00 97.00 1,688

Crime 96.00 95.00 95.00 1,572

Entertainment 97.00 98.00 97.00 1,644

Health 98.00 98.00 98.00 1,636

Politics 96.00 98.00 97.00 1,608

Sports 98.00 96.00 97.00 1,631

Avg. /total 97.00 97.00 97.00 9779

the Crime category shows the lowest accuracy. Some of the crime class dis-

tribution overlaps with the accident class due to the typical scenarios under

death-related texts.

The confusion matrix usually utilised to explain the performance of a clas-

siﬁcation model. Table 4 represents the confusion matrix of the classiﬁer model

based on the test datasets. The Health category achieved the maximum pre-

8 M. R. Hossain et al.

dicted class correctly, where 1,603 data out of 1,636 are classiﬁed corrected. On

the other hand, the Crime category obtained the higher number of misclassiﬁca-

tion (85 out of 1,572 data). The highest number of misclassiﬁcation occurred in

Table 4: Confusion matrix.

Classes Accident Crime Entertainment Health Politics Sports

Accident 1638 42 1 3 2 2

Crime 53 1487 1 4 26 1

Entertainment 1 2 1609 10 6 16

Health 2 2 17 1603 9 3

Politics 2 16 2 12 1574 2

Sports 2 7 37 9 16 1560

the sports and entertainment categories due to semantic/syntactic similarities.

Most of the sports tournament organised an opening and closing ceremony with

fabulous events; thus, the text related to sports overlaps with the entertainment.

A receiver operating characteristic (ROC) curve presents the performance of a

classiﬁcation model. Fig. 3 depicts ROC curve with class-wise area distributions.

AUC values 1.0 indicates the model predicted classes 100% accurately.

Fig. 3: ROC curves for text classiﬁcation model.

Text Classiﬁcation 9

5.1 Comparison with The Previous Techniques

We compared performance of the proposed system with existing techniques.

Due to the unavailability of benchmark dataset in Bengali, several methods are

implemented on our developed corpus. Table 5 shows the classiﬁcation perfor-

mance of diﬀerent techniques on the test datasets in terms of accuracy with 200

embedding dimension.

Table 5: Performance comparison of diﬀerent approaches

Methods Accuracy (%)

TF-IDF-SVM [3] 78.00

Word2Vec-SVM [12] 84.21

GloVe-SVM [13] 85.03

FastText-SVM [14] 86.12

Word2Vec-CNN [15] 94.17

GloVe-CNN [16] 95.44

FastText-CNN (Proposed) 96.85

Statistical based methods such as SVM classiﬁers achieved the poor accuracy

([3], [12], [13], and [14]) due to lack of semantic feature extraction capabilities.

Word2Vec and GloVe feature extraction methods extract the semantic feature

as well thus Word2Vec-CNN [15], and GloVe-CNN [16] methods perform better

than SVM classiﬁers. Word2Vec, and GloVe embedding methods cannot handle

the sub-word information, whereas the FastText embedding does. As a result,

the proposed method (FastText-CNN) provides the highest accuracy of 96.85%,

which is 2.68% improved accuracy than Word2Vec-CNN [15] and 1.41% greater

accuracy than the GloVe-CNN method [16].

6 Conclusion

In this paper, we introduce a convolution neural network-based model with Fast-

Text embedding for text document classiﬁcation of resource-constrained lan-

guages. A corpus of low-resource language, namely Bengali text documents, are

developed to assess the performance of the proposed model. Diﬀerent hyper-

parameters of the CNN model is tuned for optimisation and hence to achieve

better classiﬁcation results. Evaluation results on test datasets showed improved

performance of the proposed method compared to the existing techniques. More

text document classes can be included with more data. Other word embedding

techniques such as ElMo, and BERT can be explored for further investigation.

These issues left for future research.

Acknowledgement

This work was supported by the University Grants Commission of Bangladesh.

10 M. R. Hossain et al.

References

1. Phani, S., Lahiri, S., Biswas, A.: A Supervised Learning Approach for Authorship

Attribution of Bengali Literary Texts, ACM Trans. Asian Low Resour. Lang. Inf.

Process, vol. 16(4), pp. 1-15, (2017).

2. Hossain, M.R., Hoque, M.M.: Automatic Bengali Document Categorization Based

on Deep Convolution Nets, Emerging Research in Computing, Information, Com-

munication and Applications, vol. 882. Springer, Singapore, (2019).

3. Utomo, M.R.A., Sibaroni, Y.: Text Classiﬁcation of British English and American

English Using Support Vector Machine, Proc. Int. Con. on ICoICT, pp. 1-6, (2019).

4. Elnagar, A., Al-Debsi, R., Einea, O.: Arabic text classiﬁcation using deep learning

models, J. of Inf. Pro. & Man., vol. 57, no. 1, January (2020).

5. Xie, J., Hou, Y., Wang, Y. et al.: Chinese text classiﬁcation based on atten-

tion mechanism and feature-enhanced fusion neural network. Computing 102, pp.

683–700, (2020).

6. Mikolov, T., Chen, K., Corrado G., Dean, J.: Eﬃcient Estimation of Word Repre-

sentations in Vector Space, Journal of CoRR, (2013).

7. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global Vectors for Word Rep-

resentation, Proc. EMNLP, pp. 1532-1543, (2014).

8. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with

Subword Information, Journal of CoRR, vol. abs/1607.04606, (2016).

9. Kowsari, K., Brown, D.E., Heidarysafa, M., et al.: Hierarchical deep learning for

text classiﬁcation, 16th IEEE ICMLA, Cancun, Mexico pp. 364-371, Dec. (2017).

10. Karim, M.R., Chakravarthi, B.R., McCrae J.P., Cochez, M.: Classiﬁcation

Benchmarks for Under-resourced Bengali Language based on Multichannel

Convolutional-LSTM Network, arXiv:2004.07807, (2020).

11. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Representations by Back-

Propagating Errors, Nature, vol. 323, pp. 533-536, (1986).

12. Lilleberg, J., Zhu, Y., Zhang, Y.: Support Vector Machines and Word2Vec for Text

Classiﬁcation with Semantic Features, Proc. on ICCI*CC, pp. 136-140, (2015).

13. Yeh, C.L., Loni, B., Schuth, A.: Tom Jumbo-Grumbo at SemEval-2019 Task 4:

Hyperpartisan News Detection with GloVe vectors and SVM, in Proc. Int. Work.

on Semantic Evaluation, ACL, pp. 1067-1071, (2019).

14. Alghamdi N., Assiri, F.: A Comparison of FastText Implementations Using Arabic

Text Classiﬁcation, Int. Sys., vol. 1038, pp. 306-311, Springer Cham, (2019).

15. Jang, B., Kim, I., Kim, JW.: Word2Vec convolutional neural networks for classiﬁ-

cation of news articles and tweets, PLOS ONE, vol. 14(8), (2019).

16. Parwez, M.A., Abulaish, M., Jahiruddin.: Multi-Label Classiﬁcation of Microblog-

ging Texts Using Convolution Neural Network, in IEEE Access, vol. 7, pp. 68678-

68691, (2019).

17. Pal, K., Patel, B.V.: Automatic Multiclass Document Classiﬁcation of Hindi Poems

using Machine Learning Techniques, 2020 International Conference for Emerging

Technology (INCET), Belgaum, India, pp. 1-5, (2020).

18. Oshi R., Goel P., Joshi R.: Deep Learning for Hindi Text Classiﬁcation: A Com-

parison, Intelligent Human Computer Interaction (IHCI) 2019, Lecture Notes in

Computer Science, vol 11886. Springer, Cham, (2019).

19. Rahman, M., Haque R., Saurav, Z.R.: Identifying and Categorizing Opinions Ex-

pressed in Bangla Sentences using Deep Learning Technique, International Journal

of Computer Applications, vol. 176(17), pp. 13-17, April (2020).

Text classification using embeddings: a survey

Article

Full-text available

Mar 2023
KNOWL INF SYST

Text classification results can be hindered when just the bag-of-words model is used for representing features, because it ignores word order and senses, which can vary with the context. Embeddings have recently emerged as a means to circumvent these limitations, allowing considerable performance gains. However, determining the best combinations of classification techniques and embeddings for classifying particular corpora can be challenging. This survey provides a comprehensive review of text classification approaches that employ embeddings. First, it analyzes past and recent advancements in feature representation for text classification. Then, it identifies the combinations of embedding-based feature representations and classification techniques that have provided the best performances for classifying text from distinct corpora, also providing links to the original articles, source code (when available) and data sets used in the performance evaluation. Finally, it discusses current challenges and promising directions for text classification research, such as cost-effectiveness, multi-label classification, and the potential of knowledge graphs and knowledge embeddings to enhance text classification.

Toward Embedding Hyperparameters Optimization: Analyzing Their Impacts on Deep Leaning-Based Text Classification

Chapter

Jun 2023

In the last few years, an enormous amount of unstructured text documents has been added to the World Wide Web because of the availability of electronics gadgets and increases the usability of the Internet. Using text classification, this large amount of texts are appropriately organized, searched, and manipulated by the high resource language (e.g., English). Nevertheless, till now, it is a so-called issue for low-resource languages (like Bengali). There is no usable research and has conducted on Bengali text classification owing to the lack of standard corpora, shortage of hyperparameters tuning method of text embeddings and insufficiency of embedding model evaluations system (e.g., intrinsic and extrinsic). Text classification performance depends on embedding features, and the best embedding hyperparameter settings can produce the best embedding feature. The embedding model default hyperparameters values are developed for high resource language, and these hyperparameters settings are not well performed for low-resource languages. The low-resource hyperparameters tuning is a crucial task for the text classification domain. This study investigates the influence of embedding hyperparameters on Bengali text classification. The empirical analysis concludes that an automatic embedding hyperparameter tuning (AEHT) with convolutional neural networks (CNNs) attained the maximum text classification accuracy of 95.16 and 86.41% for BARD and IndicNLP datasets.KeywordsNatural language processingLow-resource text classificationHyperparameters tuningEmbeddingFeature extraction

AraCovTexFinder: Leveraging the transformer-based language model for Arabic COVID-19 text identification

Article

Full-text available

Jan 2024
ENG APPL ARTIF INTEL

In light of the pandemic, the identification and processing of COVID-19-related text have emerged as critical research areas within the field of Natural Language Processing (NLP). With a growing reliance on online portals and social media for information exchange and interaction, a surge in online textual content, comprising disinformation, misinformation, fake news, and rumors has led to the phenomenon of an infodemic on the World Wide Web. Arabic, spoken by over 420 million people worldwide, stands as a significant low-resource language, lacking efficient tools or applications for the detection of COVID-19-related text. Additionally, the identification of COVID-19 text is an essential prerequisite task for detecting fake and toxic content associated with COVID-19. This gap hampers crucial COVID information retrieval and processing necessary for policymakers and health authorities. Addressing this issue, this paper introduces an intelligent Arabic COVID-19 text identification system named 'AraCovTexFinder,' leveraging a fine-tuned fusion-based transformer model. Recognizing the challenges posed by a scarcity of related text corpora, substantial morphological variations in the language, and a deficiency of well-tuned hyperparameters, the proposed system aims to mitigate these hurdles. To support the proposed method, two corpora are developed: an Arabic embedding corpus (AraEC) and an Arabic COVID-19 text identification corpus (AraCoV). The study evaluates the performance of six transformer-based language models (mBERT, XML-RoBERTa, mDeBERTa-V3, mDistilBERT, BERT-Arabic, and AraBERT), 12 deep learning models (combining Word2Vec, GloVe, and FastText embedding with CNN, LSTM, VDCNN, and BiLSTM), and the newly introduced model AraCovTexFinder. Through extensive evaluation, AraCovTexFinder achieves a high accuracy of 98.89 ± 0.001%, outperforming other baseline models, including transformer-based language and deep learning models. This research highlights the importance of specialized tools in low-resource languages to combat the infodemic relating to COVID-19, which can assist policymakers and health authorities in making informed decisions.

Classifying Bengali Newspaper Headlines with Advanced Deep Learning Models: LSTM, Bi-LSTM, and Bi-GRU Approaches

Article

Full-text available

Dec 2023

Reading newspapers is beneficial for people of all ages and the global community. The enjoyment of gathering diverse data from various sources adds to the overall experience. To enhance specificity in Bengali news headlines, recognizing the news genre becomes crucial. Recognizing the genre of the news, it is a very challenging task in Bengali Text Classification with the help of AI. A very few research works is done on Bengali News headline classification and we have done a model to provide a solution to the addressed issue. Due to the continuous change of the structure of the news headlines, we have employed a neural network adoption connection to our methodology experiment on a mixture of primary and secondary dataset. Achieving significant results, we implemented a Bengali dataset in Multi Classification using Long-Short Term Memory (LSTM), Bi- Long-Short Term Memory (Bi-LSTM), and Bi-Gated Recurrent Unit (Bi-GRU). The dataset is established by aggregating news headlines from various Bengali news portals and websites, showcasing robust categorization performance in the end product. Six categories were employed for the classification of Bengali newspaper headlines. The Bi-LSTM Model emerged with the highest training accuracy at 97.96% and the lowest validation accuracy at 77.91%. Furthermore, it demonstrated enhanced sensitivity and specificity.

CoBertTC: Covid-19 Text Classification Using Transformer-Based Language Models

Chapter

Dec 2023

Covid-19 has significantly impacted human life, decreasing face-to-face communication and causing an exponential rise in virtual interactions. Consequently, online platforms like news websites, blogs, and social media have become the primary source of information for many aspects, particularly Covid-19-related news. Nonetheless, accurately categorizing Covid-19-related text data is an ongoing research challenge during and after the pandemic. This paper introduces a Covid-19-related text classification system named CoBerTC to address this issue, which consists of three primary modules: transformer-based language model fine-tuning, transformer-based language model inference, and best-performing model selection. Six transformer-based language models are exploited for the text classification task, including mBERT, XML-RoBERTa, mDistilBERT, IndicBERT, MuRIL, and mDeBERTa-V3 on the English Covid-19 text classification corpus (ECovC). The findings reveal that XML-RoBERTa achieved the highest accuracy of 94.22% for the Covid text classification task among the six models.

Application of Quantum Recurrent Neural Network in Low Resource Language Text Classification

Article

Full-text available

Jan 2024

Text sentiment analysis is an important task in natural language processing and has always been a hot research topic. However, in low-resource regions such as South Asia, where languages like Bengali are widely used, the research interest is relatively low compared to high-resource regions due to limited computational resources, flexible word order, and high inflectional nature of the language. With the development of quantum technology, quantum machine learning models leverage the superposition property of qubits to enhance model expressiveness and achieve faster computation compared to classical systems. To promote the development of quantum machine learning in low-resource language domains, we propose a quantum-classical hybrid architecture. This architecture utilizes a pre-trained multilingual BERT model to obtain vector representations of words and combines the proposed Batched Upload Quantum Recurrent Neural Network (BUQRNN) and Parameter Non-shared Batched Upload Quantum Recurrent Neural Network (PN-BUQRNN) as feature extraction models for sentiment analysis in Bengali. Our numerical results demonstrate that the proposed BUQRNN structure achieves a maximum accuracy improvement of 0.993% in Bengali text classification tasks while reducing average model complexity by 12%. The PN-BUQRNN structure surpasses the BUQRNN structure once again and outperforms classical architectures in certain tasks.

Exploring Hierarchical Multi-Label Text Classification Models using Attention-Based Approaches for Vietnamese language

Conference Paper

Mar 2024

Mental Health Prediction Model on Social Media Data Using CNN-BiLSTM

Article

Full-text available

Feb 2024

Social media has transformed into a global platform for expression and interaction where users can share photos, images, and videos. The rapid development and widespread use of social media afford the opportunity to analyze the construction of social life in societies and communities. As a result of alterations in lifestyle during the COVID-19 pandemic, mental health disorders increased. Mental health is a complex disease involving numerous individual, socioeconomic, and clinical variables. Natural language processing and analysis methods are required to address this complexity. The classification of mental health-related texts, which can serve as early warnings and early diagnoses, is facilitated by analytical and natural language processing techniques. In this investigation, a CNN-BiLSTM model was utilized, which was aided by a FastText-based word weighting method. The utilized data set consists of texts on mental health with labels such as borderline personality disorder (BPD), anxiety, depression, bipolar, mentalillness, schizophrenia, and poison. There are 35000 training records and 6108 test records. The data will undergo a data cleansing procedure, which will include lower text stages, number removal, reading mark removal, and stopword removal. Modeling with CNN-BiLSTM and FastText weighting yielded an F1-Score and accuracy of 85% and 85%, respectively. In comparison to the Bi-LSTM model, the F1-Score and accuracy were both 83%.

Intrinsic and Extrinsic Evaluation of Sentiment-Specific Word Embeddings

Chapter

Dec 2023

Sentiment-specific Wore embedding model generation and evaluation are crucial for low-resource languages. In this paper explores the challenges of sentiment-specific embedding model generation and evaluation for low-resource language, i.e., Bengali. It incorporates the effectiveness of three distinct embedding techniques (Word2Vec, GloVe, and FastText) for sentiment-specific word embeddings (SSWE). This study evaluates the performance of each embedding technique using intrinsic and extrinsic evaluation methods. Results demonstrate that the GloVe-based SSWE model achieved the highest syntactic and semantic similarity accuracy, with a Pearson correlation of 61.78% and 60.23%, respectively, and a Spearman correlation of 60.88% and 60.34%, respectively. The extrinsic evaluation involved sentiment classification using various classifiers, and the highest accuracy of 92.88% was achieved using the Glove+CNN model. Overall, this study provides insights into effective techniques for sentiment analysis in low-resource languages.

978-981-19-8032-9

Book

Jul 2023

Focuses on the research trends, challenges, and future of artificial intelligence

Identifying and Categorizing Opinions Expressed in Bangla Sentences using Deep Learning Technique

Article

Full-text available

Apr 2020

Identifying and categorizing opinions in a sentence is the most prominent branch of natural language processing. It deals with the text classification to determine the intention of the author of the text. The intention can be for the presentation of happiness, sadness, patriotism, disgust, advice, etc. Most of the research work on opinion or sentiment analysis is in the English language. Bengali corpus is increasing day by day. A large number of online News portals publish their articles in Bengali language and a few News portals have the comment section that allows expressing the opinion of people. Here a research work has been done on Bengali Sports news comments published in different newspapers to train a deep learning model that will be able to categorize a comment according to its sentiment. Comments are collected and separated based on immanent sentiment.

Deep Learning for Hindi Text Classification: A Comparison

Chapter

Full-text available

Apr 2020

Natural Language Processing (NLP) and especially natural language text analysis have seen great advances in recent times. Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved remarkable results. Different deep learning architectures like CNN, LSTM, and very recent Transformer have been used to achieve state of the art results variety on NLP tasks. In this work, we survey a host of deep learning architectures for text classification tasks. The work is specifically concerned with the classification of Hindi text. The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus. In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention. Multilingual pre-trained sentence embeddings based on BERT and LASER are also compared to evaluate their effectiveness for the Hindi language. The paper also serves as a tutorial for popular text classification techniques.

Chinese text classification based on attention mechanism and feature-enhanced fusion neural network

Article

Full-text available

Mar 2020
COMPUTING

Owing to the uneven distribution of key features in Chinese texts, key features play different roles in text recognition in Chinese text classification tasks. We propose a feature-enhanced fusion model based on attention mechanism for Chinese text classification, a long short-term memory (LSTM) network, a convolutional neural network (CNN), and a feature-difference enhancement attention algorithm model. The Chinese text is digitized into a vector form containing certain semantic context information into the embedding layer to train and test the neural network by preprocessing. The feature-enhanced fusion model is implemented by double-layer LSTM and CNN modules to enhance the fusion of text features extracted from the attention mechanism for classifying the classifiers. The feature-difference enhancement attention algorithm model not only adds more weight to important text features but also strengthens the differences between them and other text features. This operation can further improves the effect of important features on Chinese text recognition. The two models are classified by the softmax function. The text classification experiments are conducted based on the Chinese text corpus. The experimental results show that compared with the contrast model, the proposed algorithm can significantly improve the recognition ability of Chinese text features.

Text Classification of British English and American English Using Support Vector Machine

Conference Paper

Full-text available

Jul 2019

Word2vec convolutional neural networks for classification of news articles and tweets

Article

Full-text available

Aug 2019
PLOS ONE

Big web data from sources including online news and Twitter are good resources for investigating deep learning. However, collected news articles and tweets almost certainly contain data unnecessary for learning, and this disturbs accurate learning. This paper explores the performance of word2vec Convolutional Neural Networks (CNNs) to classify news articles and tweets into related and unrelated ones. Using two word embedding algorithms of word2vec, Continuous Bag-of-Word (CBOW) and Skip-gram, we constructed CNN with the CBOW model and CNN with the Skip-gram model. We measured the classification accuracy of CNN with CBOW, CNN with Skip-gram, and CNN without word2vec models for real news articles and tweets. The experimental results indicated that word2vec significantly improved the accuracy of the classification model. The accuracy of the CBOW model was higher and more stable when compared to that of the Skip-gram model. The CBOW model exhibited better performance on news articles, and the Skip-gram model exhibited better performance on tweets. Specifically, CNN with word2vec models was more effective on news articles when compared to that on tweets because news articles are typically more uniform when compared to tweets.

Multi-Label Classification of Microblogging Texts Using Convolution Neural Network

Article

Full-text available

May 2019

Microblogging sites contain huge amount of textual data and their classification is an imperative task in many applications like information filtering, user profiling, topical analysis, and content tagging. Traditional machine learning approaches mainly use bag of words or n-gram techniques to generate feature vectors as text representation to train classifiers and perform considerably well for many text information processing tasks. Since short texts like tweets contain a very limited number of words, the traditional machine learning approaches suffer from data sparsity and curse of dimensionality problems due to feature representation using bag of words or n-grams techniques. Nowadays, use of feature vectors like word embeddings as input to neural networks for text classification and clustering has shown remarkable performance gain. In this paper, we present the different neural network models to for multi-label classification of microblogging data. The proposed models are based on Convolutional Neural Network (CNN) architectures, which utilize pre-trained word embeddings from generic and domain-specific textual data sources. The word embeddings are used individually and in various combinations through different channels of CNN to predict class labels. We also present a comparative analysis of the proposed CNN models with traditional machine learning models and one of the existing CNN architectures. The proposed models are evaluated over a real Twitter dataset, and experimental results establish their efficacy to classify microblogging texts with improved accuracy in comparison to the traditional machine learning approaches and existing CNN models.

Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network

Conference Paper

Oct 2020

Automatic Multiclass Document Classification of Hindi Poems using Machine Learning Techniques

Conference Paper

Jun 2020

Arabic text classification using deep learning models

Article

Jan 2020
INFORM PROCESS MANAG

Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on “Masrawy” dataset.

A Comparison of fastText Implementations Using Arabic Text Classification

Chapter

Jan 2020

The quality of word representation is crucial to obtain good results in many natural language processing tasks. Recently, many word representation models (word embeddings), such as fastText, have been developed. In this research, we compared the algorithms for the fastText implementation, Facebook’s official implementation, and Gensim’s implementation using the same pre-trained fastText model. Using multi-class classification, we evaluated these embeddings. According to the results, the Facebook implementation performed better than Gensim’s implementation, with an average accuracy of 78.22% and 56.73%, respectively, for sentence embeddings and an average accuracy of 79.43% and 57.95%, respectively, for word embeddings.

Text Classification Using Convolution Neural Networks with FastText Embedding

Abstract and Figures

Recommended publications

A text sentiment classification model using double word embedding methods

Seeing Colors: Learning Semantic Text Encoding for Classification

Bengali text document categorization based on very deep convolution neural network

Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

Toward Embedding Hyperparameters Optimization: Analyzing Their Impacts on Deep Leaning-Based Text Cl...