ChapterPDF Available

Abstract and Figures

Text classification has a growing interest among NLP researchers due to its tremendous availability on online platforms and emergence on various Web 2.0 applications. Recently, text classification in resource-constrained languages has been bringing much attention due to the sharp increase of digital resources. This paper presents a CNN based text classification model for one of the low resource languages like Bengali. The goal of the Bengali text classification is to assign a particular category to a text into one of the pre-defined categories based on its semantic and syntactic meaning. The proposed system comprises of four key modules: embedding model generation, Text to feature representation, training, and testing. The classification system trained and validated with 39, 079 and 6, 000 text datasets. Experimental evaluation with 9, 779 test datasets shows the accuracy of \(96.85\%\), which indicates the superior performance compared to the existing techniques.
Content may be subject to copyright.
Text Classification Using Convolution Neural
Networks with FastText Embedding
Md. Rajib Hossain ID , Mohammed Moshiul Hoque ID , and Iqbal H. Sarker ID
Department of Computer Science and Engineering, Chittagong University of
Engineering and Technology, Chittagong-4349, Bangladesh
rajcsecuet@gmail.com,moshiul 240@cuet.ac.bd,iqbal@cuet.ac.bd
Abstract. Text classification has a growing interest among NLP re-
searchers due to its tremendous availability on online platforms and
emergence on various Web 2.0 applications. Recently, text classification
in resource-constrained languages has been bringing much attention due
to the sharp increase of digital resources. This paper presents a CNN
based text classification model for one of the low resource languages like
Bengali. The goal of the Bengali text classification is to assign a par-
ticular category to a text into one of the pre-defined categories based
on its semantic and syntactic meaning. The proposed system comprises
of four key modules: embedding model generation, Text to feature rep-
resentation, training, and testing. The classification system trained and
validated with 39,079 and 6,000 text datasets. Experimental evaluation
with 9,779 test datasets shows the accuracy of 96.85%, which indicates
the superior performance compared to the existing techniques.
Keywords: Natural language processing, text classification, Feature rep-
resentation, Convolutional neural networks, and Evaluation
1 Introduction
In the recent year, data storage on the world wide web increased enormously
due to the effortless use of electronic gadgets in web 2.0 applications and the
availability of the Internet. Much amount of these data are available in textual
forms and unstructured. These enormous amounts of unstructured text data
require to arrange efficiently so that sorting, manipulating and searching tasks
can perform quickly or easily. However, manual classification of voluminous data
into their pre-defined classes demands huge time, enormous effort and cost of
money which may inaccurate, or infeasible in most cases. Thus, the automatic
text classification technique is one of the agile solutions to process such a large
amount of text data that significantly reduce human labour, saves time, and cost
of money. Classification of text document implies to the task of automatically
assigning a class or category to a textual data chosen from a set of predetermined
levels. The text classification system may be utilized by the security agency to
identify the rumours streamed data or spam detection, the circadian newspapers
2 M. R. Hossain et al.
to organize news by subject categories, the library to relegate papers or books,
the hospitals to categorize patient predicated on the diagnosis.
Although Bengali is the 7th most widely spoken language in the world where
about 245 million people are communicating via Bengali, it has been considered
one of the low-resource languages [1]. It is a very complicated task to develop an
automatic text classification system for low-resource languages such as Bengali.
Scarcity of digital resources and deficiency of benchmark corpora makes the task
more challenging. With the taking into consideration of the current constraints,
we proposed a CNN based Bengali text classification system with FastText word
embedding technique. CNN based methods performing composition over word
vectors to extract complex features, has been proven to be effective classifiers
and achieve excellent performance on different text classification. The major
contributions of this research are:
Develop a corpus containing 150,000 Bengali text documents for word em-
bedding and 54,858 text documents to classify into 6 classes.
Investigate optimized hyperparameters for FastText and CNN algorithms.
Develop CNN based classifier model to classify the Bengali text on a self-
developed dataset.
Evaluate the performance of the proposed CNN text classification model on
the developed dataset.
2 Related Work
There is significant progress on text classification in English, Arabic, Chinese
and some European languages [4], [5]. However, the text classification problem
is in rudimentary stage till to date in the realm of Bengali language. Mikolov
et al. [6] developed a shallow neural network-based word embedding model
(Word2Vec), which carried out both semantic and syntactic features. Word-word
co-occurrence based model (GloVe) also covered the semantic and syntactic fea-
tures [7]. GloVe and Word2Vec both can represent the word information, but
these techniques are failed to deal with sub-word details and out of vocabulary
problem. Bojanowski et al. [8] have developed a sub-word knowledge-based em-
bedding model (FastText) which overcame the problem of Glove and Word2Vec
feature representation techniques. Unfortunately, FastText feature representa-
tion technique is not well explored till to data for low resource language like
Bengali due to shortage of text corpus. A Hierarchical Deep Learning (HDL)
based text classification with Golve embeddings is introduced by Kowsari et al.
[9], which obtained 90.93% accuracy.
Few studies have conducted on text classification in low resource languages
including Bengali and Hindi which are mostly based on machine learning such
as SVM, Stochastic Gradient Descent (SGD) and Decision Tree. Pal et al. [17]
developed a Hindi poem classification system using Naive Bayes which achieved
64% accuracy in three classes. Hossain et al. [2] developed a Bengali text clas-
sification system using DCNN with GloVe feature extraction and achieved an
accuracy of 94.96% for 12 categories. Karim et al. [10] developed a convolution
Text Classification 3
LSTM based Bengali text classification system, which achieved 92.30% accu-
racy for five classes. Rahman et al. [19] developed a deep CNN based emotion
classification system which achieved 75.57% F1score for five emotion cate-
gories. A CNN based model with FastText embedding is developed by Joshi
et al. [18] for Hindi text classification, which obtained 92.8% accuracy in six
document categories. However, most of the previous approaches in low resource
languages, including Bengali suffered from the out-of-vocabulary problem and
lacking considering the sub-word information which is essential to gain better
classification performance. The proposed model considers FastText embedding
with CNN classification model, which reduce the weakness of existing techniques.
3 Proposed Methodology
The proposed framework comprises four essential components: FastText embed-
ding model generation, text to feature representation, training, and testing. Fig.
1 depicts the overview of the CNN based text classification framework.
Fig. 1: CNN based Bengali text classification framework.
3.1 Embedding Model Generation
FastText [8] algorithm used to generate embedding model which initialize with
the embedding corpus (EC ). The EC consists of several texts such as EC =
{t1, t2, t3, ..., tE}and ith text ti= [1,2,3, ..., E]. Edenotes the number of embed-
ding text in EC . The texts in EC used as the input of the FastText and generates
4 M. R. Hossain et al.
an embedding model (EM). Corpus to single file conversion process, takes the
embedding texts (t1tE) sequentially and marge one after another to gener-
ate a single file as embedding corpus. The pre-processing step removes all the
non-Bengali alphabets, mathematical symbols, HTML tags, and non-Unicode
symbols. FastText training algorithm considers the embedding corpus file as the
input and produces EM with the dimension of ((W×F)(750000 ×300))
(W×F). The symbol W(W= 750000) denotes the number of the unique
words, and F(F= 300) indicates the feature dimension in EM.
3.2 Text to Feature Representation Module
Labelled text use as the input of the feature representation module during the
training phase, which passes to the tokenization process. The tokenization pro-
cess split the input text into a word list. FastText feature map process takes
both the word list and the EM input. For each of the word in the word list, Fast-
Text feature mapping extracts a total of 200 features where the feature values
as fashioned in rows. Finally, the FastText feature mapping process generates a
feature matrix (FM) with F M (1024×200). If a text contains more than 1024
word, then the process truncates the first 1024 words. It added zero paddings if
the text consists of less than 1024 words.
3.3 Text Classification Training Module
The training module takes FM as the input and build a classifier model. The
CNN start with a input layer (IL) where IL={I1, I2, I3, ..., In}and Idenotes
the input node and nindicates the nodes size. Input layer followed by a multi-
kernel convolutional and ReLU layer. The convolutional layer (Conv) define as
C=C1, C2, C3, ..., Cpqrwhere Cdenotes tensor node and p, q, r indicates the
tensor dimension. There are three different Conv operations performed with IL
feature matrix. The first Cvon kernel size is (3,3) with tensor size of (128,3,200),
Second Conv kernel size is (5,5) with tensor of (127,5,200) and the third Conv
kernel size is (7,7) with tensor size of (128,7,200). The Conv operation performed
by using the Eq. (1).
Aith:=
j=h
X
j=1
(I[j:200])(K[ith :200]) (1)
here, Aith:denotes the ith Conv operation output and hindicates the tensor
height. The output tensor size of these three Conv layers are (128,1,1022),
(128,1,1020), and (128,1,1018) respectively. ReLu operation applied to each of
the output tensor. The pooling layer define as PL={P1, P2, P3, ..., Pxyz},
where x, y, z indicates the tensor size. The max-pool with kernels (1022,1),
(1020,1) and (1018,1) are applied to the pooling layer (PL). Each of the max-
pool extracted 128 feature values.
The ConC at layer inputted the output of the pooling layer and concatenated
one after another, which produces the dense vector of 384 dimension. Dropout
Text Classification 5
layer takes the dense layer input and blocking some node based on the dropout
values. The feature vector (384 dimension) uses as input of the output layer
and generates an expected value with a class label. The error value is deter-
mined from this predicted class value and update the kernel weights using the
backpropagation technique [11]. The process is continuing until the convergence
occurred in the training phase. In our model, we observed that the training pro-
cess converged at the epoch number in between 25to30 and finally the training
output saved as a hierarchical file format (.meta).
3.4 Testing Module
The testing module consider the unlabeled text as the input and determine a
class name of that input. Initially, the unlabeled text pass through the feature
representation module which generates a feature matrix (FM) (1024×200)). This
FM is sent to the testing module along with the training model which produces
a score vector (S). The score is calculated by using the Eq. (2).
S[1:j]j=6
j=1 =eXj
Pi=6
i=1 eXi
(2)
here, S[1:j]denotes the output score and S={S1, S2, S3, ..., S6}.Wdenotes
the weights matrix and Xsubscript represent the feature vector. The expected
output is the maximum value of S.
4 Experiments
The proposed text classification model implemented in a multi-core processor
with NVIDIA GTX 1070 GPU. The size of the physical memory is 32GB,
with GPU internal memory of 8GB. The CNN architecture has deployed in
the Tensor-Flow framework of Python 3.6 framework.
4.1 Text Corpus
Owing to the unavailability of a benchmark corpus in Bengali language, we
developed a corpus to serve our purpose by considering four main steps: data
crawling, pre-processing, hand annotation, and verification. The crawler crawled
data from accessible online resources such as blogs, newspapers, and e-books.
Each of the source text is encoded as UTF-8 and stored in .txt form. Unlabelled
crawled data (150,000 text documents) are used to word embedding model during
the training phase. Around 25,000 data are discarded during the pre-processing
phase and remaining 125,000 data are used for hand annotation. In the hand
annotation phase, five annotators inspected each text and labelled into one of
the six categories such as accident, crime, entertainment, health, politics and
sports. The initial label of 85,000 data are settled based on majority voting of
the annotators whereas rest 40,000 data are discarded due to ill-formatted. One
6 M. R. Hossain et al.
language expert assigned to verify 85,000 labelled data manually. Finally, the
corpus included 54,858 verified labelled data based on the opinion of the expert.
Table 1 depicts few characteristics of the developed corpus.
Table 1: Statistics of embedding and categorical corpus
Embedding attributes Value Categorical attributes Value
No. of text 150,000 No. of classes 6
No. of sentence 287,000 No. of text 54,858
No. of words 166,381,093 No. of sentence 150,620
No. of unique words 1,350,049 No. of words 1,506,200
Tru. vocab. min count 2 750,196 No. of unique words 560,150
5 Results and Analysis
The performance of the proposed model evaluates in two phases: training or
validation phase and testing phase. The loss and accuracy is calculated in train-
ing/validation phase. In the case of testing phase, precision (Pr), recall (Rc),
accuracy (Ac), and F1-measure is used as measures.
We adjusted hyper-parameters of the word embedding model and CNN based
on our developed corpus for better performance. After performing hundreds
of experiments on the developed corpus, optimised hyper-parameter values are
found (Table 2).
Table 2: Optimized hyperparameters for CNN and Embedding
Embedding hyperparameters Value CNN hyperparameters Value
Embedding dimension 200 Kernel size 3,5,7
Model skipgram No. of kernel 128
Minimum word count 2 Batch size 256
Window size 15 Dropout 0.46
Max. n-gram 7 Epoch 30
Min n-gram 3 Loss type Categorical cross-entropy
lr 0.10 lr 0.087
A total of 39,079 data allocated for training, 6Kdata for validation and
39,079 data for testing. The convergences of the classifier model depend on the
differences between validation accuracy and training accuracy. Fig. 2 shows the
progress of model convergences in terms of the number of epochs.
The training starts from 0.21, continues upward from epoch 1 to 17 and
converge at 22 with the maximum accuracy 100.00%. The validation accuracy
Text Classification 7
(a) Accuracy vs epochs at training
and validation phase.
(b) Loss vs epochs at training and
validation phase.
Fig. 2: Effect of training and validation accuracy/loss on epoch numbers.
starts from 89 and converges at epoch number20 with the highest accuracy of
0.97. The training loss initialises with 6.23 and stable at 20 epoch while the
validation loss starts from 0.21 and stable at epoch 20. Thus, the results reveal
that the classifier model converges at epoch 20.
Table 3 illustrates the performance of the text classifier model on test datasets
in terms of precision, recall and F1-score. The Health category achieved the
highest precision (98.00%), recall (98.00%) and F1-score (98.00%), whereas the
Crime category, gained the lowest precision (96.00%), recall (96.00%) and F1-
score (96.00%). Due to the semantic and syntactic similarity between intraclass,
Table 3: Test time classifier model performance summary.
Class names Pr (%) Rc (%) F1-measure No. of test texts
Accident 96.00 97.00 97.00 1,688
Crime 96.00 95.00 95.00 1,572
Entertainment 97.00 98.00 97.00 1,644
Health 98.00 98.00 98.00 1,636
Politics 96.00 98.00 97.00 1,608
Sports 98.00 96.00 97.00 1,631
Avg. /total 97.00 97.00 97.00 9779
the Crime category shows the lowest accuracy. Some of the crime class dis-
tribution overlaps with the accident class due to the typical scenarios under
death-related texts.
The confusion matrix usually utilised to explain the performance of a clas-
sification model. Table 4 represents the confusion matrix of the classifier model
based on the test datasets. The Health category achieved the maximum pre-
8 M. R. Hossain et al.
dicted class correctly, where 1,603 data out of 1,636 are classified corrected. On
the other hand, the Crime category obtained the higher number of misclassifica-
tion (85 out of 1,572 data). The highest number of misclassification occurred in
Table 4: Confusion matrix.
Classes Accident Crime Entertainment Health Politics Sports
Accident 1638 42 1 3 2 2
Crime 53 1487 1 4 26 1
Entertainment 1 2 1609 10 6 16
Health 2 2 17 1603 9 3
Politics 2 16 2 12 1574 2
Sports 2 7 37 9 16 1560
the sports and entertainment categories due to semantic/syntactic similarities.
Most of the sports tournament organised an opening and closing ceremony with
fabulous events; thus, the text related to sports overlaps with the entertainment.
A receiver operating characteristic (ROC) curve presents the performance of a
classification model. Fig. 3 depicts ROC curve with class-wise area distributions.
AUC values 1.0 indicates the model predicted classes 100% accurately.
Fig. 3: ROC curves for text classification model.
Text Classification 9
5.1 Comparison with The Previous Techniques
We compared performance of the proposed system with existing techniques.
Due to the unavailability of benchmark dataset in Bengali, several methods are
implemented on our developed corpus. Table 5 shows the classification perfor-
mance of different techniques on the test datasets in terms of accuracy with 200
embedding dimension.
Table 5: Performance comparison of different approaches
Methods Accuracy (%)
TF-IDF-SVM [3] 78.00
Word2Vec-SVM [12] 84.21
GloVe-SVM [13] 85.03
FastText-SVM [14] 86.12
Word2Vec-CNN [15] 94.17
GloVe-CNN [16] 95.44
FastText-CNN (Proposed) 96.85
Statistical based methods such as SVM classifiers achieved the poor accuracy
([3], [12], [13], and [14]) due to lack of semantic feature extraction capabilities.
Word2Vec and GloVe feature extraction methods extract the semantic feature
as well thus Word2Vec-CNN [15], and GloVe-CNN [16] methods perform better
than SVM classifiers. Word2Vec, and GloVe embedding methods cannot handle
the sub-word information, whereas the FastText embedding does. As a result,
the proposed method (FastText-CNN) provides the highest accuracy of 96.85%,
which is 2.68% improved accuracy than Word2Vec-CNN [15] and 1.41% greater
accuracy than the GloVe-CNN method [16].
6 Conclusion
In this paper, we introduce a convolution neural network-based model with Fast-
Text embedding for text document classification of resource-constrained lan-
guages. A corpus of low-resource language, namely Bengali text documents, are
developed to assess the performance of the proposed model. Different hyper-
parameters of the CNN model is tuned for optimisation and hence to achieve
better classification results. Evaluation results on test datasets showed improved
performance of the proposed method compared to the existing techniques. More
text document classes can be included with more data. Other word embedding
techniques such as ElMo, and BERT can be explored for further investigation.
These issues left for future research.
Acknowledgement
This work was supported by the University Grants Commission of Bangladesh.
10 M. R. Hossain et al.
References
1. Phani, S., Lahiri, S., Biswas, A.: A Supervised Learning Approach for Authorship
Attribution of Bengali Literary Texts, ACM Trans. Asian Low Resour. Lang. Inf.
Process, vol. 16(4), pp. 1-15, (2017).
2. Hossain, M.R., Hoque, M.M.: Automatic Bengali Document Categorization Based
on Deep Convolution Nets, Emerging Research in Computing, Information, Com-
munication and Applications, vol. 882. Springer, Singapore, (2019).
3. Utomo, M.R.A., Sibaroni, Y.: Text Classification of British English and American
English Using Support Vector Machine, Proc. Int. Con. on ICoICT, pp. 1-6, (2019).
4. Elnagar, A., Al-Debsi, R., Einea, O.: Arabic text classification using deep learning
models, J. of Inf. Pro. & Man., vol. 57, no. 1, January (2020).
5. Xie, J., Hou, Y., Wang, Y. et al.: Chinese text classification based on atten-
tion mechanism and feature-enhanced fusion neural network. Computing 102, pp.
683–700, (2020).
6. Mikolov, T., Chen, K., Corrado G., Dean, J.: Efficient Estimation of Word Repre-
sentations in Vector Space, Journal of CoRR, (2013).
7. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global Vectors for Word Rep-
resentation, Proc. EMNLP, pp. 1532-1543, (2014).
8. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with
Subword Information, Journal of CoRR, vol. abs/1607.04606, (2016).
9. Kowsari, K., Brown, D.E., Heidarysafa, M., et al.: Hierarchical deep learning for
text classification, 16th IEEE ICMLA, Cancun, Mexico pp. 364-371, Dec. (2017).
10. Karim, M.R., Chakravarthi, B.R., McCrae J.P., Cochez, M.: Classification
Benchmarks for Under-resourced Bengali Language based on Multichannel
Convolutional-LSTM Network, arXiv:2004.07807, (2020).
11. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Representations by Back-
Propagating Errors, Nature, vol. 323, pp. 533-536, (1986).
12. Lilleberg, J., Zhu, Y., Zhang, Y.: Support Vector Machines and Word2Vec for Text
Classification with Semantic Features, Proc. on ICCI*CC, pp. 136-140, (2015).
13. Yeh, C.L., Loni, B., Schuth, A.: Tom Jumbo-Grumbo at SemEval-2019 Task 4:
Hyperpartisan News Detection with GloVe vectors and SVM, in Proc. Int. Work.
on Semantic Evaluation, ACL, pp. 1067-1071, (2019).
14. Alghamdi N., Assiri, F.: A Comparison of FastText Implementations Using Arabic
Text Classification, Int. Sys., vol. 1038, pp. 306-311, Springer Cham, (2019).
15. Jang, B., Kim, I., Kim, JW.: Word2Vec convolutional neural networks for classifi-
cation of news articles and tweets, PLOS ONE, vol. 14(8), (2019).
16. Parwez, M.A., Abulaish, M., Jahiruddin.: Multi-Label Classification of Microblog-
ging Texts Using Convolution Neural Network, in IEEE Access, vol. 7, pp. 68678-
68691, (2019).
17. Pal, K., Patel, B.V.: Automatic Multiclass Document Classification of Hindi Poems
using Machine Learning Techniques, 2020 International Conference for Emerging
Technology (INCET), Belgaum, India, pp. 1-5, (2020).
18. Oshi R., Goel P., Joshi R.: Deep Learning for Hindi Text Classification: A Com-
parison, Intelligent Human Computer Interaction (IHCI) 2019, Lecture Notes in
Computer Science, vol 11886. Springer, Cham, (2019).
19. Rahman, M., Haque R., Saurav, Z.R.: Identifying and Categorizing Opinions Ex-
pressed in Bangla Sentences using Deep Learning Technique, International Journal
of Computer Applications, vol. 176(17), pp. 13-17, April (2020).
... The works [32] and [72] use FastText embeddings in their text classification models. Hossain et al. [72] introduced a convolution neural network-based model using FastText embedding for text document classification of resource-constrained languages. ...
... The works [32] and [72] use FastText embeddings in their text classification models. Hossain et al. [72] introduced a convolution neural network-based model using FastText embedding for text document classification of resource-constrained languages. A corpus of documents in a low-resource language, namely Bengali, was developed to assess the performance of the proposed model, and different hyperparameters of the CNN model were tuned for optimization and hence to achieve better classification results. ...
... Liu et al. [59] Liu et al. [87] Hossain et al. [72] Saraswat et al. [89] Gallo et al. [60] https://gitlab.com/nicolalandro/visual_word_embeddings ...
Article
Full-text available
Text classification results can be hindered when just the bag-of-words model is used for representing features, because it ignores word order and senses, which can vary with the context. Embeddings have recently emerged as a means to circumvent these limitations, allowing considerable performance gains. However, determining the best combinations of classification techniques and embeddings for classifying particular corpora can be challenging. This survey provides a comprehensive review of text classification approaches that employ embeddings. First, it analyzes past and recent advancements in feature representation for text classification. Then, it identifies the combinations of embedding-based feature representations and classification techniques that have provided the best performances for classifying text from distinct corpora, also providing links to the original articles, source code (when available) and data sets used in the performance evaluation. Finally, it discusses current challenges and promising directions for text classification research, such as cost-effectiveness, multi-label classification, and the potential of knowledge graphs and knowledge embeddings to enhance text classification.
... However, there are much research has conducted on Bengali text classification based on statistical learning [3], and deep learning [7] with pre-trained embedding model, and manual hyperparameters tuning method [6]. However, single research has been conducted on embedding model hyperparameters optimization, e.g., automatic embedding parameters identification (EPI) that can tune the hyperparameters for Bengali embedding models [8]. ...
... But this system only considered the 10.0% data for validation purposes. The CNN classifier with the FastText embedding model, Bengali text classification system, has achieved an accuracy of 96.85% for self build datasets [7] with six different categories. This system builds the embedding model using pre-tuned embedding hyperparameters. ...
Chapter
In the last few years, an enormous amount of unstructured text documents has been added to the World Wide Web because of the availability of electronics gadgets and increases the usability of the Internet. Using text classification, this large amount of texts are appropriately organized, searched, and manipulated by the high resource language (e.g., English). Nevertheless, till now, it is a so-called issue for low-resource languages (like Bengali). There is no usable research and has conducted on Bengali text classification owing to the lack of standard corpora, shortage of hyperparameters tuning method of text embeddings and insufficiency of embedding model evaluations system (e.g., intrinsic and extrinsic). Text classification performance depends on embedding features, and the best embedding hyperparameter settings can produce the best embedding feature. The embedding model default hyperparameters values are developed for high resource language, and these hyperparameters settings are not well performed for low-resource languages. The low-resource hyperparameters tuning is a crucial task for the text classification domain. This study investigates the influence of embedding hyperparameters on Bengali text classification. The empirical analysis concludes that an automatic embedding hyperparameter tuning (AEHT) with convolutional neural networks (CNNs) attained the maximum text classification accuracy of 95.16 and 86.41% for BARD and IndicNLP datasets.KeywordsNatural language processingLow-resource text classificationHyperparameters tuningEmbeddingFeature extraction
... Statistical classification methods such as Support Vector Machine (SVM) and Stochastic Gradient Descent (SGD) are unable to manage the large feature dimension (Nayal et al., 2017). Deep learning-based methods like CNN, LSTM, and VDCNN have failed to predict concise texts (Hossain et al., 2021b). Furthermore, to our knowledge, none of the past studies have performed an automatic COVID-19 text identification system in Arabic. ...
Article
Full-text available
In light of the pandemic, the identification and processing of COVID-19-related text have emerged as critical research areas within the field of Natural Language Processing (NLP). With a growing reliance on online portals and social media for information exchange and interaction, a surge in online textual content, comprising disinformation, misinformation, fake news, and rumors has led to the phenomenon of an infodemic on the World Wide Web. Arabic, spoken by over 420 million people worldwide, stands as a significant low-resource language, lacking efficient tools or applications for the detection of COVID-19-related text. Additionally, the identification of COVID-19 text is an essential prerequisite task for detecting fake and toxic content associated with COVID-19. This gap hampers crucial COVID information retrieval and processing necessary for policymakers and health authorities. Addressing this issue, this paper introduces an intelligent Arabic COVID-19 text identification system named 'AraCovTexFinder,' leveraging a fine-tuned fusion-based transformer model. Recognizing the challenges posed by a scarcity of related text corpora, substantial morphological variations in the language, and a deficiency of well-tuned hyperparameters, the proposed system aims to mitigate these hurdles. To support the proposed method, two corpora are developed: an Arabic embedding corpus (AraEC) and an Arabic COVID-19 text identification corpus (AraCoV). The study evaluates the performance of six transformer-based language models (mBERT, XML-RoBERTa, mDeBERTa-V3, mDistilBERT, BERT-Arabic, and AraBERT), 12 deep learning models (combining Word2Vec, GloVe, and FastText embedding with CNN, LSTM, VDCNN, and BiLSTM), and the newly introduced model AraCovTexFinder. Through extensive evaluation, AraCovTexFinder achieves a high accuracy of 98.89 ± 0.001%, outperforming other baseline models, including transformer-based language and deep learning models. This research highlights the importance of specialized tools in low-resource languages to combat the infodemic relating to COVID-19, which can assist policymakers and health authorities in making informed decisions.
... Researchers of article [4] employed CNNs to categorize Bengali text into five categories in a study titled "Bengali Text Classification using Convolutional Neural Networks with Fast Text Embedding," attaining an accuracy of 96.85%. In essence, this publication offers a comprehensive overview of previous research efforts in Bengali text classification. ...
Article
Full-text available
Reading newspapers is beneficial for people of all ages and the global community. The enjoyment of gathering diverse data from various sources adds to the overall experience. To enhance specificity in Bengali news headlines, recognizing the news genre becomes crucial. Recognizing the genre of the news, it is a very challenging task in Bengali Text Classification with the help of AI. A very few research works is done on Bengali News headline classification and we have done a model to provide a solution to the addressed issue. Due to the continuous change of the structure of the news headlines, we have employed a neural network adoption connection to our methodology experiment on a mixture of primary and secondary dataset. Achieving significant results, we implemented a Bengali dataset in Multi Classification using Long-Short Term Memory (LSTM), Bi- Long-Short Term Memory (Bi-LSTM), and Bi-Gated Recurrent Unit (Bi-GRU). The dataset is established by aggregating news headlines from various Bengali news portals and websites, showcasing robust categorization performance in the end product. Six categories were employed for the classification of Bengali newspaper headlines. The Bi-LSTM Model emerged with the highest training accuracy at 97.96% and the lowest validation accuracy at 77.91%. Furthermore, it demonstrated enhanced sensitivity and specificity.
... Methods A(%) SGD-Word2Vec [5] 72.73 DCNN-RNN [6] 84.05 ECoVC CNN-Fasttext [8] 82.68 dGloVe+CNN [7] 88.89 XML-RoBERTa-V3 ...
Chapter
Covid-19 has significantly impacted human life, decreasing face-to-face communication and causing an exponential rise in virtual interactions. Consequently, online platforms like news websites, blogs, and social media have become the primary source of information for many aspects, particularly Covid-19-related news. Nonetheless, accurately categorizing Covid-19-related text data is an ongoing research challenge during and after the pandemic. This paper introduces a Covid-19-related text classification system named CoBerTC to address this issue, which consists of three primary modules: transformer-based language model fine-tuning, transformer-based language model inference, and best-performing model selection. Six transformer-based language models are exploited for the text classification task, including mBERT, XML-RoBERTa, mDistilBERT, IndicBERT, MuRIL, and mDeBERTa-V3 on the English Covid-19 text classification corpus (ECovC). The findings reveal that XML-RoBERTa achieved the highest accuracy of 94.22% for the Covid text classification task among the six models.
Article
Full-text available
Text sentiment analysis is an important task in natural language processing and has always been a hot research topic. However, in low-resource regions such as South Asia, where languages like Bengali are widely used, the research interest is relatively low compared to high-resource regions due to limited computational resources, flexible word order, and high inflectional nature of the language. With the development of quantum technology, quantum machine learning models leverage the superposition property of qubits to enhance model expressiveness and achieve faster computation compared to classical systems. To promote the development of quantum machine learning in low-resource language domains, we propose a quantum-classical hybrid architecture. This architecture utilizes a pre-trained multilingual BERT model to obtain vector representations of words and combines the proposed Batched Upload Quantum Recurrent Neural Network (BUQRNN) and Parameter Non-shared Batched Upload Quantum Recurrent Neural Network (PN-BUQRNN) as feature extraction models for sentiment analysis in Bengali. Our numerical results demonstrate that the proposed BUQRNN structure achieves a maximum accuracy improvement of 0.993% in Bengali text classification tasks while reducing average model complexity by 12%. The PN-BUQRNN structure surpasses the BUQRNN structure once again and outperforms classical architectures in certain tasks.
Article
Full-text available
Social media has transformed into a global platform for expression and interaction where users can share photos, images, and videos. The rapid development and widespread use of social media afford the opportunity to analyze the construction of social life in societies and communities. As a result of alterations in lifestyle during the COVID-19 pandemic, mental health disorders increased. Mental health is a complex disease involving numerous individual, socioeconomic, and clinical variables. Natural language processing and analysis methods are required to address this complexity. The classification of mental health-related texts, which can serve as early warnings and early diagnoses, is facilitated by analytical and natural language processing techniques. In this investigation, a CNN-BiLSTM model was utilized, which was aided by a FastText-based word weighting method. The utilized data set consists of texts on mental health with labels such as borderline personality disorder (BPD), anxiety, depression, bipolar, mentalillness, schizophrenia, and poison. There are 35000 training records and 6108 test records. The data will undergo a data cleansing procedure, which will include lower text stages, number removal, reading mark removal, and stopword removal. Modeling with CNN-BiLSTM and FastText weighting yielded an F1-Score and accuracy of 85% and 85%, respectively. In comparison to the Bi-LSTM model, the F1-Score and accuracy were both 83%.
Chapter
Sentiment-specific Wore embedding model generation and evaluation are crucial for low-resource languages. In this paper explores the challenges of sentiment-specific embedding model generation and evaluation for low-resource language, i.e., Bengali. It incorporates the effectiveness of three distinct embedding techniques (Word2Vec, GloVe, and FastText) for sentiment-specific word embeddings (SSWE). This study evaluates the performance of each embedding technique using intrinsic and extrinsic evaluation methods. Results demonstrate that the GloVe-based SSWE model achieved the highest syntactic and semantic similarity accuracy, with a Pearson correlation of 61.78% and 60.23%, respectively, and a Spearman correlation of 60.88% and 60.34%, respectively. The extrinsic evaluation involved sentiment classification using various classifiers, and the highest accuracy of 92.88% was achieved using the Glove+CNN model. Overall, this study provides insights into effective techniques for sentiment analysis in low-resource languages.
Book
Focuses on the research trends, challenges, and future of artificial intelligence
Article
Full-text available
Identifying and categorizing opinions in a sentence is the most prominent branch of natural language processing. It deals with the text classification to determine the intention of the author of the text. The intention can be for the presentation of happiness, sadness, patriotism, disgust, advice, etc. Most of the research work on opinion or sentiment analysis is in the English language. Bengali corpus is increasing day by day. A large number of online News portals publish their articles in Bengali language and a few News portals have the comment section that allows expressing the opinion of people. Here a research work has been done on Bengali Sports news comments published in different newspapers to train a deep learning model that will be able to categorize a comment according to its sentiment. Comments are collected and separated based on immanent sentiment.
Chapter
Full-text available
Natural Language Processing (NLP) and especially natural language text analysis have seen great advances in recent times. Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved remarkable results. Different deep learning architectures like CNN, LSTM, and very recent Transformer have been used to achieve state of the art results variety on NLP tasks. In this work, we survey a host of deep learning architectures for text classification tasks. The work is specifically concerned with the classification of Hindi text. The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus. In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention. Multilingual pre-trained sentence embeddings based on BERT and LASER are also compared to evaluate their effectiveness for the Hindi language. The paper also serves as a tutorial for popular text classification techniques.
Article
Full-text available
Owing to the uneven distribution of key features in Chinese texts, key features play different roles in text recognition in Chinese text classification tasks. We propose a feature-enhanced fusion model based on attention mechanism for Chinese text classification, a long short-term memory (LSTM) network, a convolutional neural network (CNN), and a feature-difference enhancement attention algorithm model. The Chinese text is digitized into a vector form containing certain semantic context information into the embedding layer to train and test the neural network by preprocessing. The feature-enhanced fusion model is implemented by double-layer LSTM and CNN modules to enhance the fusion of text features extracted from the attention mechanism for classifying the classifiers. The feature-difference enhancement attention algorithm model not only adds more weight to important text features but also strengthens the differences between them and other text features. This operation can further improves the effect of important features on Chinese text recognition. The two models are classified by the softmax function. The text classification experiments are conducted based on the Chinese text corpus. The experimental results show that compared with the contrast model, the proposed algorithm can significantly improve the recognition ability of Chinese text features.
Article
Full-text available
Big web data from sources including online news and Twitter are good resources for investigating deep learning. However, collected news articles and tweets almost certainly contain data unnecessary for learning, and this disturbs accurate learning. This paper explores the performance of word2vec Convolutional Neural Networks (CNNs) to classify news articles and tweets into related and unrelated ones. Using two word embedding algorithms of word2vec, Continuous Bag-of-Word (CBOW) and Skip-gram, we constructed CNN with the CBOW model and CNN with the Skip-gram model. We measured the classification accuracy of CNN with CBOW, CNN with Skip-gram, and CNN without word2vec models for real news articles and tweets. The experimental results indicated that word2vec significantly improved the accuracy of the classification model. The accuracy of the CBOW model was higher and more stable when compared to that of the Skip-gram model. The CBOW model exhibited better performance on news articles, and the Skip-gram model exhibited better performance on tweets. Specifically, CNN with word2vec models was more effective on news articles when compared to that on tweets because news articles are typically more uniform when compared to tweets.
Article
Full-text available
Microblogging sites contain huge amount of textual data and their classification is an imperative task in many applications like information filtering, user profiling, topical analysis, and content tagging. Traditional machine learning approaches mainly use bag of words or n-gram techniques to generate feature vectors as text representation to train classifiers and perform considerably well for many text information processing tasks. Since short texts like tweets contain a very limited number of words, the traditional machine learning approaches suffer from data sparsity and curse of dimensionality problems due to feature representation using bag of words or n-grams techniques. Nowadays, use of feature vectors like word embeddings as input to neural networks for text classification and clustering has shown remarkable performance gain. In this paper, we present the different neural network models to for multi-label classification of microblogging data. The proposed models are based on Convolutional Neural Network (CNN) architectures, which utilize pre-trained word embeddings from generic and domain-specific textual data sources. The word embeddings are used individually and in various combinations through different channels of CNN to predict class labels. We also present a comparative analysis of the proposed CNN models with traditional machine learning models and one of the existing CNN architectures. The proposed models are evaluated over a real Twitter dataset, and experimental results establish their efficacy to classify microblogging texts with improved accuracy in comparison to the traditional machine learning approaches and existing CNN models.
Article
Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on “Masrawy” dataset.
Chapter
The quality of word representation is crucial to obtain good results in many natural language processing tasks. Recently, many word representation models (word embeddings), such as fastText, have been developed. In this research, we compared the algorithms for the fastText implementation, Facebook’s official implementation, and Gensim’s implementation using the same pre-trained fastText model. Using multi-class classification, we evaluated these embeddings. According to the results, the Facebook implementation performed better than Gensim’s implementation, with an average accuracy of 78.22% and 56.73%, respectively, for sentence embeddings and an average accuracy of 79.43% and 57.95%, respectively, for word embeddings.