Conference PaperPDF Available

Bangla Document Classification using Character Level Deep Learning

Authors:

Abstract and Figures

Last few decades, the availability and accessibility of the Bangla document and its content have rapidly increased due to the rapid technological advancement. Intense research needs to be performed on various Bangla documents due to the diversity of the language and associated sentiment. Document classification is one of the fundamental problems of Natural Language Processing. To handle miss-classification and convenient indexing and searching of Bangla documents on the web, researchers nowadays exploring different fields of computer science to classify Bangla documents. In this paper, Deep Learning based approaches are implemented to classify Bangla text documents. Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) is used here for the classification task. Here we have implemented an advanced technique that encoded the documents at their character level. Documents from three different data sources are used to validate and test of the working models. The highest classification accuracy is 95.42% that is achieved on the Prothom Alo data set using LSTM. Furthermore, we presented a comparison between two models and explained how well the classification task can be carried out using our character level approach with higher accuracy.
Content may be subject to copyright.
Bangla Document Classification using Character
Level Deep Learning
Md. Mahbubur Rahman
Crowd Realty
Tokyo, Japan
mahbuburrahman2111@gmail.com
Rifat Sadik
Dept. of CSE
Jahangirnagar University
Dhaka, Bangladesh
rifat.sadik.rs@gmail.com
Al Amin Biswas
Dept. of Computer Science and Engineering
Daffodil International University
Dhaka, Bangladesh
alaminbiswas.cse@gmail.com
Abstract—Last few decades, the availability and accessibility of
the Bangla document and its content have rapidly increased due
to the rapid technological advancement. Intense research needs to
be performed on various Bangla documents due to the diversity
of the language and associated sentiment. Document classification
is one of the fundamental problems of Natural Language Pro-
cessing. To handle miss-classification and convenient indexing and
searching of Bangla documents on the web, researchers nowadays
exploring different fields of computer science to classify Bangla
documents. In this paper, Deep Learning based approaches are
implemented to classify Bangla text documents. Convolutional
Neural Network (CNN) and Long Short Term Memory (LSTM)
is used here for the classification task. Here we have implemented
an advanced technique that encoded the documents at their
character level. Documents from three different data sources
are used to validate and test of the working models. The
highest classification accuracy is 95.42% that is achieved on the
Prothom Alo data set using LSTM. Furthermore, we presented
a comparison between two models and explained how well the
classification task can be carried out using our character level
approach with higher accuracy.
Keywords—Bangla Documents, Classification, CNN, LSTM
I. INTRODUCTION
With the advancement of web technology, documents are
converted to web form instead of written in a paper or hard-
copy. The Internet is flooded with documents that carry vital
information. Newspapers, journals, articles, books, are now
accessible via the internet and user can read them online.
With this vast amount of information, documents must be
categorized or indexed accurately so that users find it conve-
nient. Categorization, retrieval, and searching are the major
challenges for handling web documents. To deal with this
problem document categorization becomes a field of potential
research.
Bangla is one of the most widely used languages, over
230 million people are native speakers and another 37 million
people used this language as a secondary language [1]. Internet
and web technology make it possible to establish an enormous
amount of Bangla online portal for news, education, health,
politics research, and many more. This allows us to acquire
knowledge about the Bangla language. But it’s a matter of
great concern that very few numbers of research work have
been done on document classification. Being a complex and
ancient language in terms of orthography and morphology,
users find it difficult to categorize Bangla documents on the
web. As a result, Bengali users are deprived of the advantages
of the internet while acquiring knowledge since it has become
a global platform for sharing knowledge and expressing opin-
ions.
Nowadays there are several document classification meth-
ods. Researchers use most conventional methods like distance
measurement, fuzzy interface rule, inverse class frequency
method, centroid based classification approach, etc [2]. But
due to the complex nature of the Bangla language, most of
the native approaches have a lack of precision. So researchers
now incorporated with artificial intelligence and its subsets
Machine Learning and Deep Learning for document classifi-
cation purposes [3].
In our research work, we have focused on Deep Learn-
ing approaches. The two most popular algorithms namely
Convolutional Neural Network (CNN) and Recurrent Neural
Network (RNN) will be used to classify web documents
to their respective categories. Since feature selection is not
required in Deep Learning the simulation of the process
becomes fast and accurate compared to traditional approaches.
A robust feature extraction called Embedding will be used.
This technique will be implemented at the character level
which provides efficiency while training model.
This paper is sorted as: Section II describes the related
work to find out the actual research gap among the existing
work and designated as Literature Review. The methodology
to accomplish this work is presented in Section III. The
experimental result with analysis is presented in Section IV.
Lastly, Section V is used to conclude this research work.
II. LITERATURE REVIEW
To propose a the state-of-the-art solution as well as the
best in the working context, this section plays a very vital
role. There are several approaches proposed by researchers
incorporating different mathematical and statistical methods
for document classifications.
Daiki at el. [4] proposed a method where embedding was
done using an image-based character that relies on character
level CNN (CLCNN) and wild card training method. Based978-1-7281-9090-7/20/$31.00 ©2020 IEEE
on the pictorial structure of each character encoding was done
and some of the inputs were treated as a wildcard for training.
Japanese novels from Aozora Bunko which consists of almost
104 novels were used in this study. The proposed method
achieved an accuracy of 86.7% in classifying documents
consisting of different varieties of expressions. To classify
multilingual Benjamin at el. [5] proposed a method that is
based on character level Convolutional Neural Network. Texts
from microblogs and other data repositories such as Wikipedia,
Tweeter posts were used to train and test the model. This
method converted stings from UTF-8 to an 8-bit sequence
and thus amplified the conventional character level encoding
techniques. Comparing with other models, the overall results
for classifying texts based on geographic location using the
proposed method were satisfying in a couple of cases. Tianshi
et al. [6] proposed a text classification approach that encoded
the text using character level Convolutional Neural Network
and used an adversarial network that received input as a
textual feature. In this study, four different corpora from
AG-news, DBPedia, 20NG, IMDB were used as datasets.
The proposed framework had achieved higher accuracy and
performed significantly on both large-scale and small datasets.
A Character-based Text Classification Network (CharTeC-Net)
was proposed by Njikam et al. [7] that constructed four
different building blocks that were used for feature extraction.
Different large scale datasets were used to train the model
for the purpose of English and Chinese news categorization,
ontology classification, and sentiment analysis. The proposed
CharTeC-Net model had achieved higher accuracy on different
datasets compared to other conventional methods when dealing
with large scale data. Xiang at el. [8] proposed a study
for text classification that was modeled using Convolutional
Neural Network at the character level and the model was
applied on different datasets. Compared to other models it
had been observed that the proposed model was effective in
classification tasks.
To identify the category of a domain from a Bangla
document Ankita et al.[9] classified Bangla texts from dif-
ferent web sources and model training accomplished using
LIBLINEAR classification algorithm. For features extraction,
Term Frequency-Inverse document frequency (TF-IDF) and
dimensionality technique were used. To classify different
domain documents, this approach resulted in higher accuracy
and outperformed traditional other traditional methods. In
another work Ankit at el.[10] Used two distance measurement
techniques to classify Bangla documents. The dataset consists
of different domain news such as business, sports, state, med-
ical, and science, and technology is collected from different
Bangla news portals. For preprocessing and feature extraction
tokenizing and vector space model creation was used. The
proposed method showed a promising result in achieving an
accuracy rate of Euclidean measurements (95.80%) . Haydar et
al. [11] proposed a Deep Learning model to extract sentiments
from the Bangla language. The extraction process was carried
out using RNN. This proposed process achieved an accuracy of
80% while mining sentiments from Facebook posts. A Deep
Learning based text classifications approach is proposed by
Moqsadur et al. [12] where Bengali sports news comments
categorize based on sentiment. CNN, RNN, and Multilayer
Perception (MLP) were used for the categorization task. Per-
formance is measured with respect to F1 score, precision, and
recall and in every criterion CNN outperformed others. Kabir
at el. [13] applied Stochastic Gradient Descent (SGD) classifier
for Bangla document categorization where feature mining in-
cluded TF-IDF. The experiment was conducted on BDNews24
documents. The performance was measured using the F1 score
which indicated that the proposed SGD classifier achieved a
score of 0.9385 that is higher than other investigated methods.
III. METHODOLOGY
Most of the text classification tasks nowadays are based on
word tokenization [14]. In this paper, we used the tokenization
at the character level. Fig. 1. shows the overall explanation of
our methodology. For each document, at first, we tokenized
the document by characters. After tokenization, we convert the
characters to sequence ids. All of the documents are then sent
to the embedding layer to get a 32-dimensional embedding
vector for each character by semantics analysis. On the other
hand, each class of the document is converted to one-hot
encoding. After that, the character embedding matrix and one
hot encoding matrix are sent to the CNN or RNN model for
text classification.
Fig. 1: Character level Deep Learning for document classifica-
tion (Systematic approach including character level encoding
and model building using Deep Learning algorithms).
A. Convolution Neural Network (CNN)
A Convolution Neural Network [15] consists of three layers
- Convolution , Pooling , and Fully Connected (FC) layer. The
Fig. 2: Character level CNN (Layered architecture of CNN consisting of embedding layer, convolutional layers and max-pooling
layers )
convolution layer uses filters for detection and extraction of
features . The pooling layer is applied on every features and
reduces their spatial sizes and by doing this it helps to reduce
high computation by downsampling. The size of the feature is
reduced by the formula 1.
os=isps+ 1
s(1)
Where os, is, ps, s are the output shape, input shape, pooling
size and stride value respectively. A fully connected(FC) layer
is used for the classification part. It mixes the signals of
information between each input dimension and each output
class so that the classification is based on the whole image.
TABLE I: CONVOLUTION AND POOLING LAYERS
USED IN OUR ARCHITECTURE.
Layer Feature Kernel Pooling Stride
1 64 7 3 1
2 64 7 3 1
3 64 3 N/A 1
4 64 3 3 1
TABLE I and Fig. 2 shows the details about different layers
used in our model. Four convolution layers are used where
the first two have kernels with size 7 and the last two have
3. There is a pooling layer with size 3 after each convolution
layer except the third one. Each of the convolution layers uses
a stride of size 1.
In the fully connected network part, there are two fully
connected(dense) layers of 256 neurons. After each fully
connected layer, a dropout layer with a rate of 0.5 is added
to control overfitting. Finally, an output layer(dense) with a
softmax activation function is used to process the output.
B. Long Short Term Memory(LSTM)
Recurrent Neural Networks [16] are very efficient for solv-
ing long-distance dependencies. In our method, we used Long
Short Term Memory(LSTM) which is a special type of RNN
[17]. There are several kinds of gates in an LSTM cell. Each
gate processes the incoming data differently and updates its
memory.
Fig. 3 illustrated the basic architecture of an LSTM cell.
Here, ht-1 is the output of the previous LSTM cell and xtis
the new input going to concatenate with ht-1.
At first, ht-1 and ht-1 is sent to every gates. The output
obtained from each gates can be expressed by the following
formulas 2,3,4,5 [18].
Fig. 3: LSTM cell (Architecture of LSTM cell network consist-
ing 3 gates namely input, forget and output gate and activation
functions )
g=tanh(xtWg
x+ht1Wg
h+bg)(2)
i=sigmoid(xtWi
x+ht1Wi
h+bi)(3)
f= tanh(xtWf
x+ht1Wf
h+bf)(4)
o= tanh(xtWo
x+ht1Wo
h+bo)(5)
Where Wxis the weight for the input, Whis the weight for
the previous cell output and b is the bias input. Finally, the
output of the LSTM cell is being calculated by the formulas
6 and 7.
st=gi+st1f(6)
ht= tanh(st)st(7)
Here, * denotes the element wise multiplication.
In our method, two LSTM cells are used sequentially (Fig.
4). In the first LSTM, 128 neurons are used and 64 neurons are
used in the second LSTM. After each LSTM cell, a dropout
layer with a rate of 0.5 is added that reduces overfitting. A
fully connected layer placed after LSTM layers. A softmax
function is used finally to process the output.
Fig. 4: Character level LSTM (Layered architecture including embedding layer, LSTM cells, and activation functions).
IV. RES ULTS AND ANALYSIS
A. Dataset
In our experiment, we have chosen three different datasets
to train and validate our model. Datasets are BARD [19],
Prothom Alo [20], and Open Source Bengali Corpus (OSBC)
dataset [21]. Among the three datasets, Prothom Alo is col-
lected by web scraping and the other two is open-source
dataset. All these data repositories contain a different num-
ber of Bangla documents. Some data have been removed
from these datasets if they have any character which doesn’t
represent the Bangla language. To measure the performance
efficiently, we have used 3 different sizes of data from each
of these datasets. The overall dataset and its splitting are
presented by the TABLE II. Furthermore, we have partitioned
the selected datasets into a ratio of 80 to 20 for training and
testing purposes.
TABLE II: OVERALL DATASET SPLITTING.
Name BARD Prothom Alo OSBC
Training data 40448 103008 63036
Testing data 10112 25753 15760
Total data 50560 128761 78796
B. Experiments
We trained our models using Nvidia Geforce 2080 Super
GPU. During training, we used batch size 8 and 100 epochs for
every experiment. We used cross-entropy as the loss function
and adam as the optimizer with initial learning rate 0.001 to
train the models.
C. Results
Accuracy and F1 score parameters are computed to analyze
the performance of the working models. Accuracy and F1
score are calculated using following formula set [22] 8, 9,
10, 11.
P recision =T P
T P +F P (8)
Recall =T P
T P +F N (9)
Accuracy =T P +T N
T P +F P +F N +T N (10)
F1 = (2 P recision Recall)
P recision +Recall (11)
a. Accuracy of CNN over epochs.
b. Loss of CNN over epochs.
Fig. 5: Accuracy and loss of CNN over epochs.
Where TP, FP, TN, FN are the true positive, false positive,
true negative, and false negative respectively.
Fig. 5 and Fig. 6 depicts the validation accuracy and loss of
CNN and LSTM model for all the working datasets over the
epochs. While TABLE III shows the model’s final accuracy
and F1 score.
D. Discussions and Findings
From Fig. 5, it is observed that CNN performs well while
performing classification on the Prothom Alo dataset. The
TABLE III: EXPERIMENTAL RESULT COMPARISON OF CNN AND LSTM
BARD Prothom Alo OSBC
Models Accuracy F1 Accuracy F1 Accuracy F1
CNN 92.85% 91.69% 94.44% 91.08% 78.10% 71.59%
LSTM 92.41% 90.76% 95.42% 92.57% 82.06% 78.07%
a. Accuracy of LSTM over epochs.
b. Loss of LSTM over epochs.
Fig. 6: Accuracy and loss of LSTM over epochs.
observed accuracy of CNN for the Prothom Alo dataset is
higher than other datasets and the validation loss of CNN for
the Prothom Alo dataset is comparatively low than the other
datasets. Fig. 6 demonstrates the validation accuracy and loss
for LSTM models. In terms of accuracy and loss, here it is
also observed that LSTM performs well for the Prothom Alo
dataset like CNN.
V. CONCLUSION
In this paper, a character level approach is implemented
that categorizes Bangla text documents. Two well-known Deep
Learning models namely CNN and LSTM are used for the
classification task. Here, 80% of data is used for the model’s
training from each of the datasets and the remaining 20% data
is used for the testing purposes. The presented character level
encoding scheme improves the accuracy of the classification
task. This scheme first tokenized the documents to their
character level and generate a sequence of character identities
on which character embedding is applied. The overall perfor-
mance of the LSTM model is satiable for three of the different
datasets. The highest accuracy and F1 score are obtained for
the Prothom Alo dataset by using the LSTM model which is
95.42% and 92.57%. For future studies, we will try to build
some more advance and robust hybrid models by integrating
different Deep Learning architectures to classify documents.
We will also implement a Bi-directional LSTM, Bert model
for categorizing Bangla text documents.
REFERENCES
[1] ”A language of Bangladesh”, Last Accessed: July 29,2020. [Online].
Available: https://www.ethnologue.com/language/ben
[2] A. Bilski, A review of artificial intelligence algorithms in document
classification,” International Journal of Electronics and Telecommuni-
cations, vol. 57, pp. 263–270, 2011.
[3] Y. Juhn and H. Liu, Artificial intelligence approaches using natural
language processing to advance ehr-based clinical research, Journal of
Allergy and Clinical Immunology, vol. 145, no. 2, pp. 463–469, 2020.
[4] D. Shimada, R. Kotani, and H. Iyatomi, “Document classification
through image-based character embedding and wildcard training,” in
2016 IEEE International Conference on Big Data (Big Data). IEEE,
2016, pp. 3922–3927.
[5] B. Adams and G. McKenzie, “Crowdsourcing the character of a place:
Character-level convolutional networks for multilingual geographic text
classification,” Transactions in GIS, vol. 22, no. 2, pp. 394–408, 2018.
[6] T. Wang, L. Liu, H. Zhang, L. Zhang, and X. Chen, “Joint character-level
convolutional and generative adversarial networks for text classification,”
Complexity, vol. 2020, 2020.
[7] A. N. Samatin Njikam and H. Zhao, “Chartec-net: An efficient and
lightweight character-based convolutional network for text classifica-
tion,” Journal of Electrical and Computer Engineering, vol. 2020, 2020.
[8] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional
networks for text classification, in Advances in neural information
processing systems, 2015, pp. 649–657.
[9] A. Dhar, N. Dash, and K. Roy, “Classification of text documents through
distance measurement: An experiment with multi-domain bangla text
documents,” in 2017 3rd International Conference on Advances in
Computing, Communication & Automation (ICACCA)(Fall). IEEE,
2017, pp. 1–6.
[10] A. Dhar, N. S. Dash, and K. Roy, Application of tf-idf feature for
categorizing documents of online bangla web text corpus, in Intelligent
Engineering Informatics. Springer, 2018, pp. 51–59.
[11] M. S. Haydar, M. Al Helal, and S. A. Hossain, “Sentiment extraction
from bangla text: A character level supervised recurrent neural network
approach,” in 2018 International Conference on Computer, Communica-
tion, Chemical, Material and Electronic Engineering (IC4ME2). IEEE,
2018, pp. 1–4.
[12] M. Rahman, S. Haque, and Z. R. Saurav, “Identifying and categorizing
opinions expressed in bangla sentences using deep learning technique,”
International Journal of Computer Applications, vol. 975, p. 8887.
[13] F. Kabir, S. Siddique, M. R. A. Kotwal, and M. N. Huda, “Bangla
text document categorization using stochastic gradient descent (sgd)
classifier, in 2015 International Conference on Cognitive Computing
and Information Processing (CCIP). IEEE, 2015, pp. 1–4.
[14] J. L. Fagan, M. D. Gunther, P. D. Over, G. Passon, C. C. Tsao,
A. Zamora, and E. M. Zamora, “Method for language-independent text
tokenization using a character categorization. Google Patents, Feb. 5
1991, uS Patent 4,991,094.
[15] Y. Li, Z. Hao, and H. Lei, “Survey of convolutional neural network,
Journal of Computer Applications, vol. 36, no. 9, pp. 2508–2515, 2016.
[16] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-
neural-network architectures and learning methods for spoken language
understanding,” in Interspeech, 2013, pp. 3771–3775.
[17] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and
long short-term memory (lstm) network,” Physica D: Nonlinear Phe-
nomena, vol. 404, p. 132306, 2020.
[18] S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, and L. Heck,
“Contextual lstm (clstm) models for large scale nlp tasks, arXiv preprint
arXiv:1602.06291, 2016.
[19] M. T. Alam and M. M. Islam, “Bard: Bangla article classification using
a new comprehensive dataset, in 2018 International Conference on
Bangla Speech and Language Processing (ICBSLP). IEEE, 2018, pp.
1–5.
[20] ”Prothom Alo”, Last Accessed: August 14, 2020. [Online]. Available:
https://www.prothomalo.com/
[21] ”Bangla Dataset (Corpus)”, Last Accessed: August 14, 2020. [Online].
Available: https://scdnlab.com/corpus/
[22] N. D. Marom, L. Rokach, and A. Shmilovici, “Using the confusion
matrix for improving ensemble classifiers,” in 2010 IEEE 26-th Con-
vention of Electrical and Electronics Engineers in Israel. IEEE, 2010,
pp. 000 555–000 559.
... But the proposed system was evaluated using only one self-build dataset [8]. Rahman et al. [13] developed a Bengali text classification system using CNN and LSTM with character level embedding and attained the maximum accuracy of 92.41% for BARD datasets. They used only a 20% data sample for training and testing purposes and character level embedding not containing the sentence level semantic/syntactic features. ...
... The automatic embedding hyperparameters tuning method achieved better performance compared to manual tuning because the manual tuning does not consider all combinations of ED, CW and intrinsic evaluation. Extrinsic Performance: Now, The best performing embedding model is AEHT-GloVe and four classification methods, e.g., CNNs [6], LSTM [13], BiL-STM [9] and SGD [3] used to measure the classification performance (e.g., extrinsic performance). The classification methods hyperparameters has taken from the previous studies. ...
... Model Name Accuracy (%) Word2Vec+SGD [3] 79.95 BARD Word2Vec+BiLSTM [9] 92.05 Char-embedding+LSTM [13] 91.10 Proposed (AEHT-GloVe+CNNs) 95.16 Word2Vec+SGD [3] 70.28 IndicNLP ...
Chapter
In the last few years, an enormous amount of unstructured text documents has been added to the World Wide Web because of the availability of electronics gadgets and increases the usability of the Internet. Using text classification, this large amount of texts are appropriately organized, searched, and manipulated by the high resource language (e.g., English). Nevertheless, till now, it is a so-called issue for low-resource languages (like Bengali). There is no usable research and has conducted on Bengali text classification owing to the lack of standard corpora, shortage of hyperparameters tuning method of text embeddings and insufficiency of embedding model evaluations system (e.g., intrinsic and extrinsic). Text classification performance depends on embedding features, and the best embedding hyperparameter settings can produce the best embedding feature. The embedding model default hyperparameters values are developed for high resource language, and these hyperparameters settings are not well performed for low-resource languages. The low-resource hyperparameters tuning is a crucial task for the text classification domain. This study investigates the influence of embedding hyperparameters on Bengali text classification. The empirical analysis concludes that an automatic embedding hyperparameter tuning (AEHT) with convolutional neural networks (CNNs) attained the maximum text classification accuracy of 95.16 and 86.41% for BARD and IndicNLP datasets.KeywordsNatural language processingLow-resource text classificationHyperparameters tuningEmbeddingFeature extraction
... Bengali's complex orthography and morphology make it challenging for users to classify documents on the internet. Since the web has developed into a global forum for information sharing and opinion exchange, Bengali users are deprived the benefits of using it for education [2]. ...
... Confusion matrix, Precision, Recall, Accuracy and F1 score parameters are computed to analyze the performance of each ML and DL classifier. Accuracy and F1 score are calculated using following TP, FP, TN, FN are the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) respectively [2]. ...
Conference Paper
Full-text available
Bengali is the world's sixth-most frequently used dialect. Natural language processing (NLP), which enables data scientists to extract useful information from the constantly growing quantity of text data available on the net, includes text categorization as a key component. It is the act of organizing already-created groups or classes of text documents. But compared to other languages, like English, Bengali does not have as many capabilities for classifying texts. This study offers an extensive analysis of the current categorization of Bengali documents. Also gives a comprehensive summary of recent advances in ML and DL, covering datasets, methods, performance, accuracy rate, classification category, strengths, and limits. The purpose of this discussion is to identify current challenges that must be addressed and recommend some guidelines for future studies that might be explored.
... Few studies have been conducted on text classification in Bengali in recent years (Rahman et al., 2020a;Alam and Islam, 2018;Rahman et al., 2020b;Hossain et al., 2021b). Most of these studies used machine learning and deep-learning techniques evaluated on the developed datasets (Alam and Islam, 2018;Hossain and Hoque, 2018;Hossain et al., 2021b), whereas few works evaluated using the publicly available datasets (Rahman et al., 2020a,b). ...
... Due to the unavailability of other research datasets, contemporary techniques were applied to the four different corpora for comparison. (Rahman et al., 2020b). The M-BERT required the highest training time (e.g., training convergence time) of 29 h and 56 min Rahman et al. (2020a), whereas the AVG-M+CNN consumed a minimum training time (23 min). ...
Article
This paper proposes an intelligent text classification framework for a resource-constrained language like Bengali, which is considered a challenging task due to the lack of standard corpora, appropriate hyper-parameter tuning method, and pre-trained language-specific embedding. The proposed framework comprises an average meta-embedding feature fusion module and a convolutions neural network module called AVG-M+CNN. This work also proposes an algorithm, i.e., automatic hyperparameter tuning and selection, for enhancing the performance of the AVG-M+CNN technique. All meta-embedding models are evaluated using the intrinsic, e.g., semantic, syntactic, relatedness word similarity, analogy tasks and extrinsic evaluators. The intrinsic evaluator evaluates 200 Bengali semantic, syntactic and relatedness word pairs. Spearman (̂), Pearson (̂) and cosine similarity correlations are used to evaluate 18 individual embedding and 9 meta-embedding models. The 3COSADD and 3COSMUL evaluators evaluate the 300 analogy tasks. The extrinsic evaluator evaluates a total of 156 classification models on four corpora: BARD, IndicNLP, Prothom-Alo and 11 (a newly developed corpus having eleven distinct categories). Among these, the AVG-M+CNN model achieves the highest accuracy regarding four Bengali corpora: 95.92±.001% for BARD, 93.10±.001% for Prothom-Alo, 90.07±.001% for 11 and 87.44±.001% for IndicNLP, respectively.
... The harmonic mean of the model's precision and recall is known as the f1-score. Accuracy is the ratio between the number of correct predictions and total predictions [13]. Accuracy value ranges from 0 to 1, with 1 being the best. ...
Conference Paper
Full-text available
Massive digital texts are now accessible, thanks to technological advancement. Any amount of disorganized writing is useless. A high-quality representative corpus of any particular language is essential for research in computational linguistics and natural language processing (NLP). Bangla NLP research is still in its infancy because of the dearth of high-quality public corpus. This paper proposed a newly produced corpus consists of 1,30,307 documents covering 10 categories collected from 11 websites, having 2,94,80,828 tokens and 17,59,085 unique tokens. Seven supervised machine learning methods are explored in this work. Furthermore, Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive explanations (SHAP) are also examined to explain about different model performance. The obtained results show that the Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) outperform other models. RF classifier achieves the highest accuracy 99.91% which is better than the existing state-of-the-art methods.
... Rahman et al [3], implementan un enfoque a nivel de caracteres que categoriza noticias bengalís. Utilizan dos modelos de deep Lear Ning CNN y LSTM, respectivamente, para la clasificación, usando la ley de Paretto. ...
Article
Full-text available
El presente proyecto consiste en desarrollar un modelo de Procesamiento del Lenguaje Natural para clasificar noticias utilizando un conjunto de datos o DataSets ya evaluados. El objetivo principal es crear un sistema que pueda identificar y asignar automáticamente las noticias a una de las categorías predefinidas: negocios, entretenimiento, política, deportes o tecnología. Esto implica el preprocesamiento de datos, extracción de características, entrenamiento de un modelo de machine learning y posteriormente su evaluación de rendimiento utilizando métricas como” precisión”,” recall 2” F1 − score”. Esto permitir ‘a determinar que tan bien el modelo puede predecir la categoría correcta para una noticia nueva o no etiquetada. Si el rendimiento del modelo es satisfactorio, se puede utilizar para clasificar noticias no etiquetadas en tiempo real. En resumen, se busca proporcionar una solución eficiente y precisa para organizar y etiquetar el contenido informativo de una noticia con ayuda de la Inteligencia Artificial.
Chapter
Parts-of-speech (POS) tagging is considered one of the most challenging fields in natural language processing (NLP). The objective of this research is to develop a POS tagger for the Assamese language. Due to the scarcity of digital linguistic resources, Assamese lacks high-performing POS taggers. To fill this gap, long short-term memory (LSTM) and bidirectional long short-term memory (Bi-LSTM) are explored in the proposed research to develop a POS tagger using an Assamese POS corpus. It is important to note that this experiment faced difficulties in understanding and managing natural language for computational linguistics, which was also anticipated. The Assamese corpus considered in this research comprises around 50,000 words. At the initial stage, while examining the first few sets of data, which are about 20,000 words, it was noticed that the taggers yielded satisfactory results. Based on the result derived, the Assamese corpus size has been enhanced to 50,000 words and better performance is noted in terms of accuracy rate, precision, recall, and F1-score. As a result, an accuracy of 91.20% is achieved for LSTM and 91.72% for Bi-LSTM. Concerned with substantial research from the NLP perspective for the Assamese language and for comparative purposes, a comparison between existing POS taggers in the Assamese language and the proposed work is also presented.
Preprint
Full-text available
The selection of features for text classification is a fundamental task in text mining and information retrieval. Despite being the sixth most widely spoken language in the world, Bangla has received little attention due to the scarcity of text datasets. In this research, we collected, annotated, and prepared a comprehensive dataset of 212,184 Bangla documents in seven different categories and made it publicly accessible. We implemented three deep learning generative models: LSTM variational autoencoder (LSTM VAE), auxiliary classifier generative adversarial network (AC-GAN), and adversarial autoencoder (AAE) to extract text features, although their applications are initially found in the field of computer vision. We utilized our dataset to train these three models and used the feature space obtained in the document classification task. We evaluated the performance of the classifiers and found that the adversarial autoencoder model produced the best feature space.
Book
Focuses on the research trends, challenges, and future of artificial intelligence
Chapter
Document set identification assigns a text to its predefined text set. Therefore, the objective of any classification work is to create a model that can categorise various texts and objects into distinct classes. In this paper, three consequential deep learning based models, viz. BiLSTM, BiLSTM with attention layer, and CNN-BiLSTM have been used, which have the auto-learning capability in Bengali corpora. The CNN-BiLSTM model has been designated the classification model for its best performance in categorising Bengali text documents. This consequential model, named BEN-CNN-BiLSTM, was learned with Bengali text documents to determine the category of an unknown Bengali document. At first, more than four lac news articles from renowned Bengali newspapers are processed. After that, the training data is processed and entered into the proposed model. Finally, the model performance is assessed using the test dataset to calculate recall, precision, F-score, and accuracy. Compared to other standard classification algorithms in Bengali text classification, our proposed BEN-CNN-BiLSTM model achieved 93.94% accuracy. Thus, it can be said that the proposed BEN-CNN-BiLSTM model can be a new document set identification technique for Bengali datasets.
Article
Full-text available
This paper introduces an extremely lightweight (with just over around two hundred thousand parameters) and computationally efficient CNN architecture, named CharTeC-Net (Character-based Text Classification Network), for character-based text classification problems. This new architecture is composed of four building blocks for feature extraction. Each of these building blocks, except the last one, uses 1 × 1 pointwise convolutional layers to add more nonlinearity to the network and to increase the dimensions within each building block. In addition, shortcut connections are used in each building block to facilitate the flow of gradients over the network, but more importantly to ensure that the original signal present in the training data is shared across each building block. Experiments on eight standard large-scale text classification and sentiment analysis datasets demonstrate CharTeC-Net’s superior performance over baseline methods and yields competitive accuracy compared with state-of-the-art methods, although CharTeC-Net has only between 181,427 and 225,323 parameters and weighs less than 1 megabyte.
Article
Full-text available
With the continuous renewal of text classification rules, text classifiers need more powerful generalization ability to process the datasets with new text categories or small training samples. In this paper, we propose a text classification framework under insufficient training sample conditions. In the framework, we first quantify the texts by a character-level convolutional neural network and input the textual features into an adversarial network and a classifier, respectively. Then, we use the real textual features to train a generator and a discriminator so as to make the distribution of generated data consistent with that of real data. Finally, the classifier is cooperatively trained by real data and generated data. Extensive experimental validation on four public datasets demonstrates that our method significantly performs better than the comparative methods.
Article
Full-text available
Identifying and categorizing opinions in a sentence is the most prominent branch of natural language processing. It deals with the text classification to determine the intention of the author of the text. The intention can be for the presentation of happiness, sadness, patriotism, disgust, advice, etc. Most of the research work on opinion or sentiment analysis is in the English language. Bengali corpus is increasing day by day. A large number of online News portals publish their articles in Bengali language and a few News portals have the comment section that allows expressing the opinion of people. Here a research work has been done on Bengali Sports news comments published in different newspapers to train a deep learning model that will be able to categorize a comment according to its sentiment. Comments are collected and separated based on immanent sentiment.
Article
Full-text available
Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of “unrolling” an RNN is routinely presented without justification throughout the literature. The goal of this tutorial is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in Signal Processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the “Vanilla LSTM”¹ network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this treatise valuable as well.
Conference Paper
Full-text available
In the literature, automated Bangla article classification has been studied, where several supervised learning models have been proposed by utilizing a large textual data corpus. Despite several comprehensive textual datasets are available for different languages, a few small datasets are curated on Bangla language. As a result, a few works address Bangla document classification problem, and due to the lack of enough training data, these approaches could not able to learn sophisticated supervised learning model. In this work, we curated a large dataset of Bangla articles from different news portals, which contains around 3,76,226 articles. This huge diverse dataset helps us to train several supervised learning models by utilizing a set of sophisticated textual features, such as word embeddings,TF-IDF. In this works, our learning model shows promising performance on our curated dataset, compared to state-of-the-art works in Bangla article classification
Conference Paper
Full-text available
Over the recent years, people are heavily getting involved in the virtual world to express their opinions and feelings. Each second, hundreds of thousands of data are being gathered in the social media sites. Extraction of information from these data and finding their sentiments is known as a sentiment analysis. Sentiment analysis (SA) is an autonomous text summarization and analysis system. It is one of the most active research areas in the field of NLP and also widely studied in data mining, web mining and text mining. The significance of sentiment analysis is picking up day by day due to its direct impact on various businesses. However, it is not so straightforward to extract the sentiments when it comes to the Bangla language because of its complex grammatical structure. In this paper, a deep learning model was developed to train with Bangla language and mine the underlying sentiments. A critical analysis was performed to compare with a different deep learning model across different representation of words. The main idea is to represent Bangla sentence based on characters and extract information from the characters using a Recurrent Neural Network (RNN). These extracted information are decoded as positive, negative and neutral sentiment.
Article
Full-text available
This article presents a new character-level convolutional neural network model that can classify multilingual text written using any character set that can be encoded with UTF-8, a standard and widely used 8-bit character encoding. For geographic classification of text, we demonstrate that this approach is competitive with state-of-the-art word-based text classification methods. The model was tested on four crowdsourced data sets made up of Wikipedia articles, online travel blogs, Geonames toponyms, and Twitter posts. Unlike word-based methods, which require data cleaning and pre-processing, the proposed model works for any language without modification and with classification accuracy comparable to existing methods. Using a synthetic data set with introduced character-level errors, we show it is more robust to noise than word-level classification algorithms. The results indicate that UTF-8 character-level convolutional neural networks are a promising technique for georeferencing noisy text, such as found in colloquial social media posts and texts scanned with optical character recognition. However, word-based methods currently require less computation time to train, so are currently preferable for classifying well-formatted and cleaned texts in single languages.
Article
The wide adoption of electronic health record systems (EHRs) in health care generates big real-world data that opens new venues to conduct clinical research. As a large amount of valuable clinical information is locked in clinical narratives, natural language processing (NLP) techniques as an artificial intelligence approach have been leveraged to extract information from clinical narratives in EHRs. This capability of NLP potentially enables automated chart review for identifying patients with distinctive clinical characteristics in clinical care and reduces methodological heterogeneity in defining phenotype obscuring biological heterogeneity in research concerning allergy, asthma, and immunology. This brief review discusses the current literature on the secondary use of EHR data for clinical research concerning allergy, asthma, and immunology and highlights the potential, challenges, and implications of NLP techniques.