Content uploaded by Md. Rajib Hossain
Author content
All content in this area was uploaded by Md. Rajib Hossain on Jun 15, 2021
Content may be subject to copyright.
Content uploaded by Md. Rajib Hossain
Author content
All content in this area was uploaded by Md. Rajib Hossain on Apr 20, 2021
Content may be subject to copyright.
Text Classification Using Convolution Neural
Networks with FastText Embedding
Md. Rajib Hossain ID , Mohammed Moshiul Hoque ID , and Iqbal H. Sarker ID
Department of Computer Science and Engineering, Chittagong University of
Engineering and Technology, Chittagong-4349, Bangladesh
rajcsecuet@gmail.com,moshiul 240@cuet.ac.bd∗,iqbal@cuet.ac.bd
Abstract. Text classification has a growing interest among NLP re-
searchers due to its tremendous availability on online platforms and
emergence on various Web 2.0 applications. Recently, text classification
in resource-constrained languages has been bringing much attention due
to the sharp increase of digital resources. This paper presents a CNN
based text classification model for one of the low resource languages like
Bengali. The goal of the Bengali text classification is to assign a par-
ticular category to a text into one of the pre-defined categories based
on its semantic and syntactic meaning. The proposed system comprises
of four key modules: embedding model generation, Text to feature rep-
resentation, training, and testing. The classification system trained and
validated with 39,079 and 6,000 text datasets. Experimental evaluation
with 9,779 test datasets shows the accuracy of 96.85%, which indicates
the superior performance compared to the existing techniques.
Keywords: Natural language processing, text classification, Feature rep-
resentation, Convolutional neural networks, and Evaluation
1 Introduction
In the recent year, data storage on the world wide web increased enormously
due to the effortless use of electronic gadgets in web 2.0 applications and the
availability of the Internet. Much amount of these data are available in textual
forms and unstructured. These enormous amounts of unstructured text data
require to arrange efficiently so that sorting, manipulating and searching tasks
can perform quickly or easily. However, manual classification of voluminous data
into their pre-defined classes demands huge time, enormous effort and cost of
money which may inaccurate, or infeasible in most cases. Thus, the automatic
text classification technique is one of the agile solutions to process such a large
amount of text data that significantly reduce human labour, saves time, and cost
of money. Classification of text document implies to the task of automatically
assigning a class or category to a textual data chosen from a set of predetermined
levels. The text classification system may be utilized by the security agency to
identify the rumours streamed data or spam detection, the circadian newspapers
2 M. R. Hossain et al.
to organize news by subject categories, the library to relegate papers or books,
the hospitals to categorize patient predicated on the diagnosis.
Although Bengali is the 7th most widely spoken language in the world where
about 245 million people are communicating via Bengali, it has been considered
one of the low-resource languages [1]. It is a very complicated task to develop an
automatic text classification system for low-resource languages such as Bengali.
Scarcity of digital resources and deficiency of benchmark corpora makes the task
more challenging. With the taking into consideration of the current constraints,
we proposed a CNN based Bengali text classification system with FastText word
embedding technique. CNN based methods performing composition over word
vectors to extract complex features, has been proven to be effective classifiers
and achieve excellent performance on different text classification. The major
contributions of this research are:
–Develop a corpus containing 150,000 Bengali text documents for word em-
bedding and 54,858 text documents to classify into 6 classes.
–Investigate optimized hyperparameters for FastText and CNN algorithms.
–Develop CNN based classifier model to classify the Bengali text on a self-
developed dataset.
–Evaluate the performance of the proposed CNN text classification model on
the developed dataset.
2 Related Work
There is significant progress on text classification in English, Arabic, Chinese
and some European languages [4], [5]. However, the text classification problem
is in rudimentary stage till to date in the realm of Bengali language. Mikolov
et al. [6] developed a shallow neural network-based word embedding model
(Word2Vec), which carried out both semantic and syntactic features. Word-word
co-occurrence based model (GloVe) also covered the semantic and syntactic fea-
tures [7]. GloVe and Word2Vec both can represent the word information, but
these techniques are failed to deal with sub-word details and out of vocabulary
problem. Bojanowski et al. [8] have developed a sub-word knowledge-based em-
bedding model (FastText) which overcame the problem of Glove and Word2Vec
feature representation techniques. Unfortunately, FastText feature representa-
tion technique is not well explored till to data for low resource language like
Bengali due to shortage of text corpus. A Hierarchical Deep Learning (HDL)
based text classification with Golve embeddings is introduced by Kowsari et al.
[9], which obtained 90.93% accuracy.
Few studies have conducted on text classification in low resource languages
including Bengali and Hindi which are mostly based on machine learning such
as SVM, Stochastic Gradient Descent (SGD) and Decision Tree. Pal et al. [17]
developed a Hindi poem classification system using Naive Bayes which achieved
64% accuracy in three classes. Hossain et al. [2] developed a Bengali text clas-
sification system using DCNN with GloVe feature extraction and achieved an
accuracy of 94.96% for 12 categories. Karim et al. [10] developed a convolution
Text Classification 3
LSTM based Bengali text classification system, which achieved 92.30% accu-
racy for five classes. Rahman et al. [19] developed a deep CNN based emotion
classification system which achieved 75.57% F1−score for five emotion cate-
gories. A CNN based model with FastText embedding is developed by Joshi
et al. [18] for Hindi text classification, which obtained 92.8% accuracy in six
document categories. However, most of the previous approaches in low resource
languages, including Bengali suffered from the out-of-vocabulary problem and
lacking considering the sub-word information which is essential to gain better
classification performance. The proposed model considers FastText embedding
with CNN classification model, which reduce the weakness of existing techniques.
3 Proposed Methodology
The proposed framework comprises four essential components: FastText embed-
ding model generation, text to feature representation, training, and testing. Fig.
1 depicts the overview of the CNN based text classification framework.
Fig. 1: CNN based Bengali text classification framework.
3.1 Embedding Model Generation
FastText [8] algorithm used to generate embedding model which initialize with
the embedding corpus (EC ). The EC consists of several texts such as EC =
{t1, t2, t3, ..., tE}and ith text ti= [1,2,3, ..., E]. Edenotes the number of embed-
ding text in EC . The texts in EC used as the input of the FastText and generates
4 M. R. Hossain et al.
an embedding model (EM). Corpus to single file conversion process, takes the
embedding texts (t1−tE) sequentially and marge one after another to gener-
ate a single file as embedding corpus. The pre-processing step removes all the
non-Bengali alphabets, mathematical symbols, HTML tags, and non-Unicode
symbols. FastText training algorithm considers the embedding corpus file as the
input and produces EM with the dimension of ((W×F)∈(750000 ×300))
(W×F). The symbol W(W= 750000) denotes the number of the unique
words, and F(F= 300) indicates the feature dimension in EM.
3.2 Text to Feature Representation Module
Labelled text use as the input of the feature representation module during the
training phase, which passes to the tokenization process. The tokenization pro-
cess split the input text into a word list. FastText feature map process takes
both the word list and the EM input. For each of the word in the word list, Fast-
Text feature mapping extracts a total of 200 features where the feature values
as fashioned in rows. Finally, the FastText feature mapping process generates a
feature matrix (FM) with F M ∈(1024×200). If a text contains more than 1024
word, then the process truncates the first 1024 words. It added zero paddings if
the text consists of less than 1024 words.
3.3 Text Classification Training Module
The training module takes FM as the input and build a classifier model. The
CNN start with a input layer (IL) where IL={I1, I2, I3, ..., In}and Idenotes
the input node and nindicates the nodes size. Input layer followed by a multi-
kernel convolutional and ReLU layer. The convolutional layer (Conv) define as
C=C1, C2, C3, ..., Cp∨q∨rwhere Cdenotes tensor node and p, q, r indicates the
tensor dimension. There are three different Conv operations performed with IL
feature matrix. The first Cvon kernel size is (3,3) with tensor size of (128,3,200),
Second Conv kernel size is (5,5) with tensor of (127,5,200) and the third Conv
kernel size is (7,7) with tensor size of (128,7,200). The Conv operation performed
by using the Eq. (1).
Aith:=
j=h
X
j=1
(I[j:200])⊗(K[ith :200]) (1)
here, Aith:denotes the ith Conv operation output and hindicates the tensor
height. The output tensor size of these three Conv layers are (128,1,1022),
(128,1,1020), and (128,1,1018) respectively. ReLu operation applied to each of
the output tensor. The pooling layer define as PL={P1, P2, P3, ..., Px∨y∨z},
where x, y, z indicates the tensor size. The max-pool with kernels (1022,1),
(1020,1) and (1018,1) are applied to the pooling layer (PL). Each of the max-
pool extracted 128 feature values.
The ConC at layer inputted the output of the pooling layer and concatenated
one after another, which produces the dense vector of 384 dimension. Dropout
Text Classification 5
layer takes the dense layer input and blocking some node based on the dropout
values. The feature vector (384 dimension) uses as input of the output layer
and generates an expected value with a class label. The error value is deter-
mined from this predicted class value and update the kernel weights using the
backpropagation technique [11]. The process is continuing until the convergence
occurred in the training phase. In our model, we observed that the training pro-
cess converged at the epoch number in between 25to30 and finally the training
output saved as a hierarchical file format (.meta).
3.4 Testing Module
The testing module consider the unlabeled text as the input and determine a
class name of that input. Initially, the unlabeled text pass through the feature
representation module which generates a feature matrix (FM) (1024×200)). This
FM is sent to the testing module along with the training model which produces
a score vector (S). The score is calculated by using the Eq. (2).
S[1:j]j=6
j=1 =eW×Xj
Pi=6
i=1 eW×Xi
(2)
here, S[1:j]denotes the output score and S={S1, S2, S3, ..., S6}.Wdenotes
the weights matrix and Xsubscript represent the feature vector. The expected
output is the maximum value of S.
4 Experiments
The proposed text classification model implemented in a multi-core processor
with NVIDIA GTX 1070 GPU. The size of the physical memory is 32GB,
with GPU internal memory of 8GB. The CNN architecture has deployed in
the Tensor-Flow framework of Python 3.6 framework.
4.1 Text Corpus
Owing to the unavailability of a benchmark corpus in Bengali language, we
developed a corpus to serve our purpose by considering four main steps: data
crawling, pre-processing, hand annotation, and verification. The crawler crawled
data from accessible online resources such as blogs, newspapers, and e-books.
Each of the source text is encoded as UTF-8 and stored in ∗.txt form. Unlabelled
crawled data (150,000 text documents) are used to word embedding model during
the training phase. Around 25,000 data are discarded during the pre-processing
phase and remaining 125,000 data are used for hand annotation. In the hand
annotation phase, five annotators inspected each text and labelled into one of
the six categories such as accident, crime, entertainment, health, politics and
sports. The initial label of 85,000 data are settled based on majority voting of
the annotators whereas rest 40,000 data are discarded due to ill-formatted. One
6 M. R. Hossain et al.
language expert assigned to verify 85,000 labelled data manually. Finally, the
corpus included 54,858 verified labelled data based on the opinion of the expert.
Table 1 depicts few characteristics of the developed corpus.
Table 1: Statistics of embedding and categorical corpus
Embedding attributes Value Categorical attributes Value
No. of text 150,000 No. of classes 6
No. of sentence 287,000 No. of text 54,858
No. of words 166,381,093 No. of sentence 150,620
No. of unique words 1,350,049 No. of words 1,506,200
Tru. vocab. min count 2 750,196 No. of unique words 560,150
5 Results and Analysis
The performance of the proposed model evaluates in two phases: training or
validation phase and testing phase. The loss and accuracy is calculated in train-
ing/validation phase. In the case of testing phase, precision (Pr), recall (Rc),
accuracy (Ac), and F1-measure is used as measures.
We adjusted hyper-parameters of the word embedding model and CNN based
on our developed corpus for better performance. After performing hundreds
of experiments on the developed corpus, optimised hyper-parameter values are
found (Table 2).
Table 2: Optimized hyperparameters for CNN and Embedding
Embedding hyperparameters Value CNN hyperparameters Value
Embedding dimension 200 Kernel size 3,5,7
Model skipgram No. of kernel 128
Minimum word count 2 Batch size 256
Window size 15 Dropout 0.46
Max. n-gram 7 Epoch 30
Min n-gram 3 Loss type Categorical cross-entropy
lr 0.10 lr 0.087
A total of 39,079 data allocated for training, 6Kdata for validation and
39,079 data for testing. The convergences of the classifier model depend on the
differences between validation accuracy and training accuracy. Fig. 2 shows the
progress of model convergences in terms of the number of epochs.
The training starts from 0.21, continues upward from epoch 1 to 17 and
converge at 22 with the maximum accuracy 100.00%. The validation accuracy
Text Classification 7
(a) Accuracy vs epochs at training
and validation phase.
(b) Loss vs epochs at training and
validation phase.
Fig. 2: Effect of training and validation accuracy/loss on epoch numbers.
starts from 89 and converges at epoch number20 with the highest accuracy of
0.97. The training loss initialises with 6.23 and stable at 20 epoch while the
validation loss starts from 0.21 and stable at epoch 20. Thus, the results reveal
that the classifier model converges at epoch 20.
Table 3 illustrates the performance of the text classifier model on test datasets
in terms of precision, recall and F1-score. The Health category achieved the
highest precision (98.00%), recall (98.00%) and F1-score (98.00%), whereas the
Crime category, gained the lowest precision (96.00%), recall (96.00%) and F1-
score (96.00%). Due to the semantic and syntactic similarity between intraclass,
Table 3: Test time classifier model performance summary.
Class names Pr (%) Rc (%) F1-measure No. of test texts
Accident 96.00 97.00 97.00 1,688
Crime 96.00 95.00 95.00 1,572
Entertainment 97.00 98.00 97.00 1,644
Health 98.00 98.00 98.00 1,636
Politics 96.00 98.00 97.00 1,608
Sports 98.00 96.00 97.00 1,631
Avg. /total 97.00 97.00 97.00 9779
the Crime category shows the lowest accuracy. Some of the crime class dis-
tribution overlaps with the accident class due to the typical scenarios under
death-related texts.
The confusion matrix usually utilised to explain the performance of a clas-
sification model. Table 4 represents the confusion matrix of the classifier model
based on the test datasets. The Health category achieved the maximum pre-
8 M. R. Hossain et al.
dicted class correctly, where 1,603 data out of 1,636 are classified corrected. On
the other hand, the Crime category obtained the higher number of misclassifica-
tion (85 out of 1,572 data). The highest number of misclassification occurred in
Table 4: Confusion matrix.
Classes Accident Crime Entertainment Health Politics Sports
Accident 1638 42 1 3 2 2
Crime 53 1487 1 4 26 1
Entertainment 1 2 1609 10 6 16
Health 2 2 17 1603 9 3
Politics 2 16 2 12 1574 2
Sports 2 7 37 9 16 1560
the sports and entertainment categories due to semantic/syntactic similarities.
Most of the sports tournament organised an opening and closing ceremony with
fabulous events; thus, the text related to sports overlaps with the entertainment.
A receiver operating characteristic (ROC) curve presents the performance of a
classification model. Fig. 3 depicts ROC curve with class-wise area distributions.
AUC values 1.0 indicates the model predicted classes 100% accurately.
Fig. 3: ROC curves for text classification model.
Text Classification 9
5.1 Comparison with The Previous Techniques
We compared performance of the proposed system with existing techniques.
Due to the unavailability of benchmark dataset in Bengali, several methods are
implemented on our developed corpus. Table 5 shows the classification perfor-
mance of different techniques on the test datasets in terms of accuracy with 200
embedding dimension.
Table 5: Performance comparison of different approaches
Methods Accuracy (%)
TF-IDF-SVM [3] 78.00
Word2Vec-SVM [12] 84.21
GloVe-SVM [13] 85.03
FastText-SVM [14] 86.12
Word2Vec-CNN [15] 94.17
GloVe-CNN [16] 95.44
FastText-CNN (Proposed) 96.85
Statistical based methods such as SVM classifiers achieved the poor accuracy
([3], [12], [13], and [14]) due to lack of semantic feature extraction capabilities.
Word2Vec and GloVe feature extraction methods extract the semantic feature
as well thus Word2Vec-CNN [15], and GloVe-CNN [16] methods perform better
than SVM classifiers. Word2Vec, and GloVe embedding methods cannot handle
the sub-word information, whereas the FastText embedding does. As a result,
the proposed method (FastText-CNN) provides the highest accuracy of 96.85%,
which is 2.68% improved accuracy than Word2Vec-CNN [15] and 1.41% greater
accuracy than the GloVe-CNN method [16].
6 Conclusion
In this paper, we introduce a convolution neural network-based model with Fast-
Text embedding for text document classification of resource-constrained lan-
guages. A corpus of low-resource language, namely Bengali text documents, are
developed to assess the performance of the proposed model. Different hyper-
parameters of the CNN model is tuned for optimisation and hence to achieve
better classification results. Evaluation results on test datasets showed improved
performance of the proposed method compared to the existing techniques. More
text document classes can be included with more data. Other word embedding
techniques such as ElMo, and BERT can be explored for further investigation.
These issues left for future research.
Acknowledgement
This work was supported by the University Grants Commission of Bangladesh.
10 M. R. Hossain et al.
References
1. Phani, S., Lahiri, S., Biswas, A.: A Supervised Learning Approach for Authorship
Attribution of Bengali Literary Texts, ACM Trans. Asian Low Resour. Lang. Inf.
Process, vol. 16(4), pp. 1-15, (2017).
2. Hossain, M.R., Hoque, M.M.: Automatic Bengali Document Categorization Based
on Deep Convolution Nets, Emerging Research in Computing, Information, Com-
munication and Applications, vol. 882. Springer, Singapore, (2019).
3. Utomo, M.R.A., Sibaroni, Y.: Text Classification of British English and American
English Using Support Vector Machine, Proc. Int. Con. on ICoICT, pp. 1-6, (2019).
4. Elnagar, A., Al-Debsi, R., Einea, O.: Arabic text classification using deep learning
models, J. of Inf. Pro. & Man., vol. 57, no. 1, January (2020).
5. Xie, J., Hou, Y., Wang, Y. et al.: Chinese text classification based on atten-
tion mechanism and feature-enhanced fusion neural network. Computing 102, pp.
683–700, (2020).
6. Mikolov, T., Chen, K., Corrado G., Dean, J.: Efficient Estimation of Word Repre-
sentations in Vector Space, Journal of CoRR, (2013).
7. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global Vectors for Word Rep-
resentation, Proc. EMNLP, pp. 1532-1543, (2014).
8. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with
Subword Information, Journal of CoRR, vol. abs/1607.04606, (2016).
9. Kowsari, K., Brown, D.E., Heidarysafa, M., et al.: Hierarchical deep learning for
text classification, 16th IEEE ICMLA, Cancun, Mexico pp. 364-371, Dec. (2017).
10. Karim, M.R., Chakravarthi, B.R., McCrae J.P., Cochez, M.: Classification
Benchmarks for Under-resourced Bengali Language based on Multichannel
Convolutional-LSTM Network, arXiv:2004.07807, (2020).
11. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Representations by Back-
Propagating Errors, Nature, vol. 323, pp. 533-536, (1986).
12. Lilleberg, J., Zhu, Y., Zhang, Y.: Support Vector Machines and Word2Vec for Text
Classification with Semantic Features, Proc. on ICCI*CC, pp. 136-140, (2015).
13. Yeh, C.L., Loni, B., Schuth, A.: Tom Jumbo-Grumbo at SemEval-2019 Task 4:
Hyperpartisan News Detection with GloVe vectors and SVM, in Proc. Int. Work.
on Semantic Evaluation, ACL, pp. 1067-1071, (2019).
14. Alghamdi N., Assiri, F.: A Comparison of FastText Implementations Using Arabic
Text Classification, Int. Sys., vol. 1038, pp. 306-311, Springer Cham, (2019).
15. Jang, B., Kim, I., Kim, JW.: Word2Vec convolutional neural networks for classifi-
cation of news articles and tweets, PLOS ONE, vol. 14(8), (2019).
16. Parwez, M.A., Abulaish, M., Jahiruddin.: Multi-Label Classification of Microblog-
ging Texts Using Convolution Neural Network, in IEEE Access, vol. 7, pp. 68678-
68691, (2019).
17. Pal, K., Patel, B.V.: Automatic Multiclass Document Classification of Hindi Poems
using Machine Learning Techniques, 2020 International Conference for Emerging
Technology (INCET), Belgaum, India, pp. 1-5, (2020).
18. Oshi R., Goel P., Joshi R.: Deep Learning for Hindi Text Classification: A Com-
parison, Intelligent Human Computer Interaction (IHCI) 2019, Lecture Notes in
Computer Science, vol 11886. Springer, Cham, (2019).
19. Rahman, M., Haque R., Saurav, Z.R.: Identifying and Categorizing Opinions Ex-
pressed in Bangla Sentences using Deep Learning Technique, International Journal
of Computer Applications, vol. 176(17), pp. 13-17, April (2020).