Content uploaded by Md. Mahbubur Rahman
Author content
All content in this area was uploaded by Md. Mahbubur Rahman on Dec 21, 2020
Content may be subject to copyright.
Bangla Document Classification using Character
Level Deep Learning
Md. Mahbubur Rahman
Crowd Realty
Tokyo, Japan
mahbuburrahman2111@gmail.com
Rifat Sadik
Dept. of CSE
Jahangirnagar University
Dhaka, Bangladesh
rifat.sadik.rs@gmail.com
Al Amin Biswas
Dept. of Computer Science and Engineering
Daffodil International University
Dhaka, Bangladesh
alaminbiswas.cse@gmail.com
Abstract—Last few decades, the availability and accessibility of
the Bangla document and its content have rapidly increased due
to the rapid technological advancement. Intense research needs to
be performed on various Bangla documents due to the diversity
of the language and associated sentiment. Document classification
is one of the fundamental problems of Natural Language Pro-
cessing. To handle miss-classification and convenient indexing and
searching of Bangla documents on the web, researchers nowadays
exploring different fields of computer science to classify Bangla
documents. In this paper, Deep Learning based approaches are
implemented to classify Bangla text documents. Convolutional
Neural Network (CNN) and Long Short Term Memory (LSTM)
is used here for the classification task. Here we have implemented
an advanced technique that encoded the documents at their
character level. Documents from three different data sources
are used to validate and test of the working models. The
highest classification accuracy is 95.42% that is achieved on the
Prothom Alo data set using LSTM. Furthermore, we presented
a comparison between two models and explained how well the
classification task can be carried out using our character level
approach with higher accuracy.
Keywords—Bangla Documents, Classification, CNN, LSTM
I. INTRODUCTION
With the advancement of web technology, documents are
converted to web form instead of written in a paper or hard-
copy. The Internet is flooded with documents that carry vital
information. Newspapers, journals, articles, books, are now
accessible via the internet and user can read them online.
With this vast amount of information, documents must be
categorized or indexed accurately so that users find it conve-
nient. Categorization, retrieval, and searching are the major
challenges for handling web documents. To deal with this
problem document categorization becomes a field of potential
research.
Bangla is one of the most widely used languages, over
230 million people are native speakers and another 37 million
people used this language as a secondary language [1]. Internet
and web technology make it possible to establish an enormous
amount of Bangla online portal for news, education, health,
politics research, and many more. This allows us to acquire
knowledge about the Bangla language. But it’s a matter of
great concern that very few numbers of research work have
been done on document classification. Being a complex and
ancient language in terms of orthography and morphology,
users find it difficult to categorize Bangla documents on the
web. As a result, Bengali users are deprived of the advantages
of the internet while acquiring knowledge since it has become
a global platform for sharing knowledge and expressing opin-
ions.
Nowadays there are several document classification meth-
ods. Researchers use most conventional methods like distance
measurement, fuzzy interface rule, inverse class frequency
method, centroid based classification approach, etc [2]. But
due to the complex nature of the Bangla language, most of
the native approaches have a lack of precision. So researchers
now incorporated with artificial intelligence and its subsets
Machine Learning and Deep Learning for document classifi-
cation purposes [3].
In our research work, we have focused on Deep Learn-
ing approaches. The two most popular algorithms namely
Convolutional Neural Network (CNN) and Recurrent Neural
Network (RNN) will be used to classify web documents
to their respective categories. Since feature selection is not
required in Deep Learning the simulation of the process
becomes fast and accurate compared to traditional approaches.
A robust feature extraction called Embedding will be used.
This technique will be implemented at the character level
which provides efficiency while training model.
This paper is sorted as: Section II describes the related
work to find out the actual research gap among the existing
work and designated as Literature Review. The methodology
to accomplish this work is presented in Section III. The
experimental result with analysis is presented in Section IV.
Lastly, Section V is used to conclude this research work.
II. LITERATURE REVIEW
To propose a the state-of-the-art solution as well as the
best in the working context, this section plays a very vital
role. There are several approaches proposed by researchers
incorporating different mathematical and statistical methods
for document classifications.
Daiki at el. [4] proposed a method where embedding was
done using an image-based character that relies on character
level CNN (CLCNN) and wild card training method. Based978-1-7281-9090-7/20/$31.00 ©2020 IEEE
on the pictorial structure of each character encoding was done
and some of the inputs were treated as a wildcard for training.
Japanese novels from Aozora Bunko which consists of almost
104 novels were used in this study. The proposed method
achieved an accuracy of 86.7% in classifying documents
consisting of different varieties of expressions. To classify
multilingual Benjamin at el. [5] proposed a method that is
based on character level Convolutional Neural Network. Texts
from microblogs and other data repositories such as Wikipedia,
Tweeter posts were used to train and test the model. This
method converted stings from UTF-8 to an 8-bit sequence
and thus amplified the conventional character level encoding
techniques. Comparing with other models, the overall results
for classifying texts based on geographic location using the
proposed method were satisfying in a couple of cases. Tianshi
et al. [6] proposed a text classification approach that encoded
the text using character level Convolutional Neural Network
and used an adversarial network that received input as a
textual feature. In this study, four different corpora from
AG-news, DBPedia, 20NG, IMDB were used as datasets.
The proposed framework had achieved higher accuracy and
performed significantly on both large-scale and small datasets.
A Character-based Text Classification Network (CharTeC-Net)
was proposed by Njikam et al. [7] that constructed four
different building blocks that were used for feature extraction.
Different large scale datasets were used to train the model
for the purpose of English and Chinese news categorization,
ontology classification, and sentiment analysis. The proposed
CharTeC-Net model had achieved higher accuracy on different
datasets compared to other conventional methods when dealing
with large scale data. Xiang at el. [8] proposed a study
for text classification that was modeled using Convolutional
Neural Network at the character level and the model was
applied on different datasets. Compared to other models it
had been observed that the proposed model was effective in
classification tasks.
To identify the category of a domain from a Bangla
document Ankita et al.[9] classified Bangla texts from dif-
ferent web sources and model training accomplished using
LIBLINEAR classification algorithm. For features extraction,
Term Frequency-Inverse document frequency (TF-IDF) and
dimensionality technique were used. To classify different
domain documents, this approach resulted in higher accuracy
and outperformed traditional other traditional methods. In
another work Ankit at el.[10] Used two distance measurement
techniques to classify Bangla documents. The dataset consists
of different domain news such as business, sports, state, med-
ical, and science, and technology is collected from different
Bangla news portals. For preprocessing and feature extraction
tokenizing and vector space model creation was used. The
proposed method showed a promising result in achieving an
accuracy rate of Euclidean measurements (95.80%) . Haydar et
al. [11] proposed a Deep Learning model to extract sentiments
from the Bangla language. The extraction process was carried
out using RNN. This proposed process achieved an accuracy of
80% while mining sentiments from Facebook posts. A Deep
Learning based text classifications approach is proposed by
Moqsadur et al. [12] where Bengali sports news comments
categorize based on sentiment. CNN, RNN, and Multilayer
Perception (MLP) were used for the categorization task. Per-
formance is measured with respect to F1 score, precision, and
recall and in every criterion CNN outperformed others. Kabir
at el. [13] applied Stochastic Gradient Descent (SGD) classifier
for Bangla document categorization where feature mining in-
cluded TF-IDF. The experiment was conducted on BDNews24
documents. The performance was measured using the F1 score
which indicated that the proposed SGD classifier achieved a
score of 0.9385 that is higher than other investigated methods.
III. METHODOLOGY
Most of the text classification tasks nowadays are based on
word tokenization [14]. In this paper, we used the tokenization
at the character level. Fig. 1. shows the overall explanation of
our methodology. For each document, at first, we tokenized
the document by characters. After tokenization, we convert the
characters to sequence ids. All of the documents are then sent
to the embedding layer to get a 32-dimensional embedding
vector for each character by semantics analysis. On the other
hand, each class of the document is converted to one-hot
encoding. After that, the character embedding matrix and one
hot encoding matrix are sent to the CNN or RNN model for
text classification.
Fig. 1: Character level Deep Learning for document classifica-
tion (Systematic approach including character level encoding
and model building using Deep Learning algorithms).
A. Convolution Neural Network (CNN)
A Convolution Neural Network [15] consists of three layers
- Convolution , Pooling , and Fully Connected (FC) layer. The
Fig. 2: Character level CNN (Layered architecture of CNN consisting of embedding layer, convolutional layers and max-pooling
layers )
convolution layer uses filters for detection and extraction of
features . The pooling layer is applied on every features and
reduces their spatial sizes and by doing this it helps to reduce
high computation by downsampling. The size of the feature is
reduced by the formula 1.
os=is−ps+ 1
s(1)
Where os, is, ps, s are the output shape, input shape, pooling
size and stride value respectively. A fully connected(FC) layer
is used for the classification part. It mixes the signals of
information between each input dimension and each output
class so that the classification is based on the whole image.
TABLE I: CONVOLUTION AND POOLING LAYERS
USED IN OUR ARCHITECTURE.
Layer Feature Kernel Pooling Stride
1 64 7 3 1
2 64 7 3 1
3 64 3 N/A 1
4 64 3 3 1
TABLE I and Fig. 2 shows the details about different layers
used in our model. Four convolution layers are used where
the first two have kernels with size 7 and the last two have
3. There is a pooling layer with size 3 after each convolution
layer except the third one. Each of the convolution layers uses
a stride of size 1.
In the fully connected network part, there are two fully
connected(dense) layers of 256 neurons. After each fully
connected layer, a dropout layer with a rate of 0.5 is added
to control overfitting. Finally, an output layer(dense) with a
softmax activation function is used to process the output.
B. Long Short Term Memory(LSTM)
Recurrent Neural Networks [16] are very efficient for solv-
ing long-distance dependencies. In our method, we used Long
Short Term Memory(LSTM) which is a special type of RNN
[17]. There are several kinds of gates in an LSTM cell. Each
gate processes the incoming data differently and updates its
memory.
Fig. 3 illustrated the basic architecture of an LSTM cell.
Here, ht-1 is the output of the previous LSTM cell and xtis
the new input going to concatenate with ht-1.
At first, ht-1 and ht-1 is sent to every gates. The output
obtained from each gates can be expressed by the following
formulas 2,3,4,5 [18].
Fig. 3: LSTM cell (Architecture of LSTM cell network consist-
ing 3 gates namely input, forget and output gate and activation
functions )
g=tanh(xtWg
x+ht−1Wg
h+bg)(2)
i=sigmoid(xtWi
x+ht−1Wi
h+bi)(3)
f= tanh(xtWf
x+ht−1Wf
h+bf)(4)
o= tanh(xtWo
x+ht−1Wo
h+bo)(5)
Where Wxis the weight for the input, Whis the weight for
the previous cell output and b is the bias input. Finally, the
output of the LSTM cell is being calculated by the formulas
6 and 7.
st=g∗i+st−1∗f(6)
ht= tanh(st)∗st(7)
Here, * denotes the element wise multiplication.
In our method, two LSTM cells are used sequentially (Fig.
4). In the first LSTM, 128 neurons are used and 64 neurons are
used in the second LSTM. After each LSTM cell, a dropout
layer with a rate of 0.5 is added that reduces overfitting. A
fully connected layer placed after LSTM layers. A softmax
function is used finally to process the output.
Fig. 4: Character level LSTM (Layered architecture including embedding layer, LSTM cells, and activation functions).
IV. RES ULTS AND ANALYSIS
A. Dataset
In our experiment, we have chosen three different datasets
to train and validate our model. Datasets are BARD [19],
Prothom Alo [20], and Open Source Bengali Corpus (OSBC)
dataset [21]. Among the three datasets, Prothom Alo is col-
lected by web scraping and the other two is open-source
dataset. All these data repositories contain a different num-
ber of Bangla documents. Some data have been removed
from these datasets if they have any character which doesn’t
represent the Bangla language. To measure the performance
efficiently, we have used 3 different sizes of data from each
of these datasets. The overall dataset and its splitting are
presented by the TABLE II. Furthermore, we have partitioned
the selected datasets into a ratio of 80 to 20 for training and
testing purposes.
TABLE II: OVERALL DATASET SPLITTING.
Name BARD Prothom Alo OSBC
Training data 40448 103008 63036
Testing data 10112 25753 15760
Total data 50560 128761 78796
B. Experiments
We trained our models using Nvidia Geforce 2080 Super
GPU. During training, we used batch size 8 and 100 epochs for
every experiment. We used cross-entropy as the loss function
and adam as the optimizer with initial learning rate 0.001 to
train the models.
C. Results
Accuracy and F1 score parameters are computed to analyze
the performance of the working models. Accuracy and F1
score are calculated using following formula set [22] 8, 9,
10, 11.
P recision =T P
T P +F P (8)
Recall =T P
T P +F N (9)
Accuracy =T P +T N
T P +F P +F N +T N (10)
F1 = (2 ∗P recision ∗Recall)
P recision +Recall (11)
a. Accuracy of CNN over epochs.
b. Loss of CNN over epochs.
Fig. 5: Accuracy and loss of CNN over epochs.
Where TP, FP, TN, FN are the true positive, false positive,
true negative, and false negative respectively.
Fig. 5 and Fig. 6 depicts the validation accuracy and loss of
CNN and LSTM model for all the working datasets over the
epochs. While TABLE III shows the model’s final accuracy
and F1 score.
D. Discussions and Findings
From Fig. 5, it is observed that CNN performs well while
performing classification on the Prothom Alo dataset. The
TABLE III: EXPERIMENTAL RESULT COMPARISON OF CNN AND LSTM
BARD Prothom Alo OSBC
Models Accuracy F1 Accuracy F1 Accuracy F1
CNN 92.85% 91.69% 94.44% 91.08% 78.10% 71.59%
LSTM 92.41% 90.76% 95.42% 92.57% 82.06% 78.07%
a. Accuracy of LSTM over epochs.
b. Loss of LSTM over epochs.
Fig. 6: Accuracy and loss of LSTM over epochs.
observed accuracy of CNN for the Prothom Alo dataset is
higher than other datasets and the validation loss of CNN for
the Prothom Alo dataset is comparatively low than the other
datasets. Fig. 6 demonstrates the validation accuracy and loss
for LSTM models. In terms of accuracy and loss, here it is
also observed that LSTM performs well for the Prothom Alo
dataset like CNN.
V. CONCLUSION
In this paper, a character level approach is implemented
that categorizes Bangla text documents. Two well-known Deep
Learning models namely CNN and LSTM are used for the
classification task. Here, 80% of data is used for the model’s
training from each of the datasets and the remaining 20% data
is used for the testing purposes. The presented character level
encoding scheme improves the accuracy of the classification
task. This scheme first tokenized the documents to their
character level and generate a sequence of character identities
on which character embedding is applied. The overall perfor-
mance of the LSTM model is satiable for three of the different
datasets. The highest accuracy and F1 score are obtained for
the Prothom Alo dataset by using the LSTM model which is
95.42% and 92.57%. For future studies, we will try to build
some more advance and robust hybrid models by integrating
different Deep Learning architectures to classify documents.
We will also implement a Bi-directional LSTM, Bert model
for categorizing Bangla text documents.
REFERENCES
[1] ”A language of Bangladesh”, Last Accessed: July 29,2020. [Online].
Available: https://www.ethnologue.com/language/ben
[2] A. Bilski, “A review of artificial intelligence algorithms in document
classification,” International Journal of Electronics and Telecommuni-
cations, vol. 57, pp. 263–270, 2011.
[3] Y. Juhn and H. Liu, “Artificial intelligence approaches using natural
language processing to advance ehr-based clinical research,” Journal of
Allergy and Clinical Immunology, vol. 145, no. 2, pp. 463–469, 2020.
[4] D. Shimada, R. Kotani, and H. Iyatomi, “Document classification
through image-based character embedding and wildcard training,” in
2016 IEEE International Conference on Big Data (Big Data). IEEE,
2016, pp. 3922–3927.
[5] B. Adams and G. McKenzie, “Crowdsourcing the character of a place:
Character-level convolutional networks for multilingual geographic text
classification,” Transactions in GIS, vol. 22, no. 2, pp. 394–408, 2018.
[6] T. Wang, L. Liu, H. Zhang, L. Zhang, and X. Chen, “Joint character-level
convolutional and generative adversarial networks for text classification,”
Complexity, vol. 2020, 2020.
[7] A. N. Samatin Njikam and H. Zhao, “Chartec-net: An efficient and
lightweight character-based convolutional network for text classifica-
tion,” Journal of Electrical and Computer Engineering, vol. 2020, 2020.
[8] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional
networks for text classification,” in Advances in neural information
processing systems, 2015, pp. 649–657.
[9] A. Dhar, N. Dash, and K. Roy, “Classification of text documents through
distance measurement: An experiment with multi-domain bangla text
documents,” in 2017 3rd International Conference on Advances in
Computing, Communication & Automation (ICACCA)(Fall). IEEE,
2017, pp. 1–6.
[10] A. Dhar, N. S. Dash, and K. Roy, “Application of tf-idf feature for
categorizing documents of online bangla web text corpus,” in Intelligent
Engineering Informatics. Springer, 2018, pp. 51–59.
[11] M. S. Haydar, M. Al Helal, and S. A. Hossain, “Sentiment extraction
from bangla text: A character level supervised recurrent neural network
approach,” in 2018 International Conference on Computer, Communica-
tion, Chemical, Material and Electronic Engineering (IC4ME2). IEEE,
2018, pp. 1–4.
[12] M. Rahman, S. Haque, and Z. R. Saurav, “Identifying and categorizing
opinions expressed in bangla sentences using deep learning technique,”
International Journal of Computer Applications, vol. 975, p. 8887.
[13] F. Kabir, S. Siddique, M. R. A. Kotwal, and M. N. Huda, “Bangla
text document categorization using stochastic gradient descent (sgd)
classifier,” in 2015 International Conference on Cognitive Computing
and Information Processing (CCIP). IEEE, 2015, pp. 1–4.
[14] J. L. Fagan, M. D. Gunther, P. D. Over, G. Passon, C. C. Tsao,
A. Zamora, and E. M. Zamora, “Method for language-independent text
tokenization using a character categorization.” Google Patents, Feb. 5
1991, uS Patent 4,991,094.
[15] Y. Li, Z. Hao, and H. Lei, “Survey of convolutional neural network,”
Journal of Computer Applications, vol. 36, no. 9, pp. 2508–2515, 2016.
[16] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-
neural-network architectures and learning methods for spoken language
understanding,” in Interspeech, 2013, pp. 3771–3775.
[17] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and
long short-term memory (lstm) network,” Physica D: Nonlinear Phe-
nomena, vol. 404, p. 132306, 2020.
[18] S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, and L. Heck,
“Contextual lstm (clstm) models for large scale nlp tasks,” arXiv preprint
arXiv:1602.06291, 2016.
[19] M. T. Alam and M. M. Islam, “Bard: Bangla article classification using
a new comprehensive dataset,” in 2018 International Conference on
Bangla Speech and Language Processing (ICBSLP). IEEE, 2018, pp.
1–5.
[20] ”Prothom Alo”, Last Accessed: August 14, 2020. [Online]. Available:
https://www.prothomalo.com/
[21] ”Bangla Dataset (Corpus)”, Last Accessed: August 14, 2020. [Online].
Available: https://scdnlab.com/corpus/
[22] N. D. Marom, L. Rokach, and A. Shmilovici, “Using the confusion
matrix for improving ensemble classifiers,” in 2010 IEEE 26-th Con-
vention of Electrical and Electronics Engineers in Israel. IEEE, 2010,
pp. 000 555–000 559.