Conference PaperPDF Available

Bangla Document Classification using Character Level Deep Learning

October 2020

October 2020

DOI:10.1109/ISMSIT50672.2020.9254416

Conference: 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT)

Authors:

Md. Mahbubur Rahman

Iowa State University

Rifat Sadik

University of Delaware

Al Amin Biswas

Bangabandhu Sheikh Mujibur Rahman University, Kishoreganj

Last few decades, the availability and accessibility of the Bangla document and its content have rapidly increased due to the rapid technological advancement. Intense research needs to be performed on various Bangla documents due to the diversity of the language and associated sentiment. Document classification is one of the fundamental problems of Natural Language Processing. To handle miss-classification and convenient indexing and searching of Bangla documents on the web, researchers nowadays exploring different fields of computer science to classify Bangla documents. In this paper, Deep Learning based approaches are implemented to classify Bangla text documents. Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) is used here for the classification task. Here we have implemented an advanced technique that encoded the documents at their character level. Documents from three different data sources are used to validate and test of the working models. The highest classification accuracy is 95.42% that is achieved on the Prothom Alo data set using LSTM. Furthermore, we presented a comparison between two models and explained how well the classification task can be carried out using our character level approach with higher accuracy.

Character level Deep Learning for document classification (Systematic approach including character level encoding and model building using Deep Learning algorithms).

…

Character level CNN (Layered architecture of CNN consisting of embedding layer, convolutional layers and max-pooling layers )

…

LSTM cell (Architecture of LSTM cell network consisting 3 gates namely input, forget and output gate and activation functions )

…

Character level LSTM (Layered architecture including embedding layer, LSTM cells, and activation functions).

…

Accuracy and loss of CNN over epochs.

…

Figures - uploaded by Md. Mahbubur Rahman

Content may be subject to copyright.

Content uploaded by Md. Mahbubur Rahman

Content may be subject to copyright.

Bangla Document Classiﬁcation using Character

Level Deep Learning

Md. Mahbubur Rahman

Crowd Realty

Tokyo, Japan

mahbuburrahman2111@gmail.com

Rifat Sadik

Dept. of CSE

Jahangirnagar University

Dhaka, Bangladesh

rifat.sadik.rs@gmail.com

Al Amin Biswas

Dept. of Computer Science and Engineering

Daffodil International University

Dhaka, Bangladesh

alaminbiswas.cse@gmail.com

Abstract—Last few decades, the availability and accessibility of

the Bangla document and its content have rapidly increased due

to the rapid technological advancement. Intense research needs to

be performed on various Bangla documents due to the diversity

of the language and associated sentiment. Document classiﬁcation

is one of the fundamental problems of Natural Language Pro-

cessing. To handle miss-classiﬁcation and convenient indexing and

searching of Bangla documents on the web, researchers nowadays

exploring different ﬁelds of computer science to classify Bangla

documents. In this paper, Deep Learning based approaches are

implemented to classify Bangla text documents. Convolutional

Neural Network (CNN) and Long Short Term Memory (LSTM)

is used here for the classiﬁcation task. Here we have implemented

an advanced technique that encoded the documents at their

character level. Documents from three different data sources

are used to validate and test of the working models. The

highest classiﬁcation accuracy is 95.42% that is achieved on the

Prothom Alo data set using LSTM. Furthermore, we presented

a comparison between two models and explained how well the

classiﬁcation task can be carried out using our character level

approach with higher accuracy.

Keywords—Bangla Documents, Classiﬁcation, CNN, LSTM

I. INTRODUCTION

With the advancement of web technology, documents are

converted to web form instead of written in a paper or hard-

copy. The Internet is ﬂooded with documents that carry vital

information. Newspapers, journals, articles, books, are now

accessible via the internet and user can read them online.

With this vast amount of information, documents must be

categorized or indexed accurately so that users ﬁnd it conve-

nient. Categorization, retrieval, and searching are the major

challenges for handling web documents. To deal with this

problem document categorization becomes a ﬁeld of potential

research.

Bangla is one of the most widely used languages, over

230 million people are native speakers and another 37 million

people used this language as a secondary language [1]. Internet

and web technology make it possible to establish an enormous

amount of Bangla online portal for news, education, health,

politics research, and many more. This allows us to acquire

knowledge about the Bangla language. But it’s a matter of

great concern that very few numbers of research work have

been done on document classiﬁcation. Being a complex and

ancient language in terms of orthography and morphology,

users ﬁnd it difﬁcult to categorize Bangla documents on the

web. As a result, Bengali users are deprived of the advantages

of the internet while acquiring knowledge since it has become

a global platform for sharing knowledge and expressing opin-

ions.

Nowadays there are several document classiﬁcation meth-

ods. Researchers use most conventional methods like distance

measurement, fuzzy interface rule, inverse class frequency

method, centroid based classiﬁcation approach, etc [2]. But

due to the complex nature of the Bangla language, most of

the native approaches have a lack of precision. So researchers

now incorporated with artiﬁcial intelligence and its subsets

Machine Learning and Deep Learning for document classiﬁ-

cation purposes [3].

In our research work, we have focused on Deep Learn-

ing approaches. The two most popular algorithms namely

Convolutional Neural Network (CNN) and Recurrent Neural

Network (RNN) will be used to classify web documents

to their respective categories. Since feature selection is not

required in Deep Learning the simulation of the process

becomes fast and accurate compared to traditional approaches.

A robust feature extraction called Embedding will be used.

This technique will be implemented at the character level

which provides efﬁciency while training model.

This paper is sorted as: Section II describes the related

work to ﬁnd out the actual research gap among the existing

work and designated as Literature Review. The methodology

to accomplish this work is presented in Section III. The

experimental result with analysis is presented in Section IV.

Lastly, Section V is used to conclude this research work.

II. LITERATURE REVIEW

To propose a the state-of-the-art solution as well as the

best in the working context, this section plays a very vital

role. There are several approaches proposed by researchers

incorporating different mathematical and statistical methods

for document classiﬁcations.

Daiki at el. [4] proposed a method where embedding was

done using an image-based character that relies on character

on the pictorial structure of each character encoding was done

and some of the inputs were treated as a wildcard for training.

Japanese novels from Aozora Bunko which consists of almost

104 novels were used in this study. The proposed method

achieved an accuracy of 86.7% in classifying documents

consisting of different varieties of expressions. To classify

multilingual Benjamin at el. [5] proposed a method that is

based on character level Convolutional Neural Network. Texts

from microblogs and other data repositories such as Wikipedia,

Tweeter posts were used to train and test the model. This

method converted stings from UTF-8 to an 8-bit sequence

and thus ampliﬁed the conventional character level encoding

techniques. Comparing with other models, the overall results

for classifying texts based on geographic location using the

proposed method were satisfying in a couple of cases. Tianshi

et al. [6] proposed a text classiﬁcation approach that encoded

the text using character level Convolutional Neural Network

and used an adversarial network that received input as a

textual feature. In this study, four different corpora from

AG-news, DBPedia, 20NG, IMDB were used as datasets.

The proposed framework had achieved higher accuracy and

performed signiﬁcantly on both large-scale and small datasets.

A Character-based Text Classiﬁcation Network (CharTeC-Net)

was proposed by Njikam et al. [7] that constructed four

different building blocks that were used for feature extraction.

Different large scale datasets were used to train the model

for the purpose of English and Chinese news categorization,

ontology classiﬁcation, and sentiment analysis. The proposed

CharTeC-Net model had achieved higher accuracy on different

datasets compared to other conventional methods when dealing

with large scale data. Xiang at el. [8] proposed a study

for text classiﬁcation that was modeled using Convolutional

Neural Network at the character level and the model was

applied on different datasets. Compared to other models it

had been observed that the proposed model was effective in

classiﬁcation tasks.

To identify the category of a domain from a Bangla

document Ankita et al.[9] classiﬁed Bangla texts from dif-

ferent web sources and model training accomplished using

LIBLINEAR classiﬁcation algorithm. For features extraction,

Term Frequency-Inverse document frequency (TF-IDF) and

dimensionality technique were used. To classify different

domain documents, this approach resulted in higher accuracy

and outperformed traditional other traditional methods. In

another work Ankit at el.[10] Used two distance measurement

techniques to classify Bangla documents. The dataset consists

of different domain news such as business, sports, state, med-

ical, and science, and technology is collected from different

Bangla news portals. For preprocessing and feature extraction

tokenizing and vector space model creation was used. The

proposed method showed a promising result in achieving an

accuracy rate of Euclidean measurements (95.80%) . Haydar et

al. [11] proposed a Deep Learning model to extract sentiments

from the Bangla language. The extraction process was carried

out using RNN. This proposed process achieved an accuracy of

80% while mining sentiments from Facebook posts. A Deep

Learning based text classiﬁcations approach is proposed by

Moqsadur et al. [12] where Bengali sports news comments

categorize based on sentiment. CNN, RNN, and Multilayer

Perception (MLP) were used for the categorization task. Per-

formance is measured with respect to F1 score, precision, and

recall and in every criterion CNN outperformed others. Kabir

at el. [13] applied Stochastic Gradient Descent (SGD) classiﬁer

for Bangla document categorization where feature mining in-

cluded TF-IDF. The experiment was conducted on BDNews24

documents. The performance was measured using the F1 score

which indicated that the proposed SGD classiﬁer achieved a

score of 0.9385 that is higher than other investigated methods.

III. METHODOLOGY

Most of the text classiﬁcation tasks nowadays are based on

word tokenization [14]. In this paper, we used the tokenization

at the character level. Fig. 1. shows the overall explanation of

our methodology. For each document, at ﬁrst, we tokenized

the document by characters. After tokenization, we convert the

characters to sequence ids. All of the documents are then sent

to the embedding layer to get a 32-dimensional embedding

vector for each character by semantics analysis. On the other

hand, each class of the document is converted to one-hot

encoding. After that, the character embedding matrix and one

hot encoding matrix are sent to the CNN or RNN model for

text classiﬁcation.

Fig. 1: Character level Deep Learning for document classiﬁca-

tion (Systematic approach including character level encoding

and model building using Deep Learning algorithms).

A. Convolution Neural Network (CNN)

A Convolution Neural Network [15] consists of three layers

- Convolution , Pooling , and Fully Connected (FC) layer. The

Fig. 2: Character level CNN (Layered architecture of CNN consisting of embedding layer, convolutional layers and max-pooling

layers )

convolution layer uses ﬁlters for detection and extraction of

features . The pooling layer is applied on every features and

reduces their spatial sizes and by doing this it helps to reduce

high computation by downsampling. The size of the feature is

reduced by the formula 1.

os=is−ps+ 1

s(1)

Where os, is, ps, s are the output shape, input shape, pooling

size and stride value respectively. A fully connected(FC) layer

is used for the classiﬁcation part. It mixes the signals of

information between each input dimension and each output

class so that the classiﬁcation is based on the whole image.

TABLE I: CONVOLUTION AND POOLING LAYERS

USED IN OUR ARCHITECTURE.

Layer Feature Kernel Pooling Stride

1 64 7 3 1

2 64 7 3 1

3 64 3 N/A 1

4 64 3 3 1

TABLE I and Fig. 2 shows the details about different layers

used in our model. Four convolution layers are used where

the ﬁrst two have kernels with size 7 and the last two have

3. There is a pooling layer with size 3 after each convolution

layer except the third one. Each of the convolution layers uses

a stride of size 1.

In the fully connected network part, there are two fully

connected(dense) layers of 256 neurons. After each fully

connected layer, a dropout layer with a rate of 0.5 is added

to control overﬁtting. Finally, an output layer(dense) with a

softmax activation function is used to process the output.

B. Long Short Term Memory(LSTM)

Recurrent Neural Networks [16] are very efﬁcient for solv-

ing long-distance dependencies. In our method, we used Long

Short Term Memory(LSTM) which is a special type of RNN

[17]. There are several kinds of gates in an LSTM cell. Each

gate processes the incoming data differently and updates its

memory.

Fig. 3 illustrated the basic architecture of an LSTM cell.

Here, ht-1 is the output of the previous LSTM cell and xtis

the new input going to concatenate with ht-1.

At ﬁrst, ht-1 and ht-1 is sent to every gates. The output

obtained from each gates can be expressed by the following

formulas 2,3,4,5 [18].

Fig. 3: LSTM cell (Architecture of LSTM cell network consist-

ing 3 gates namely input, forget and output gate and activation

functions )

g=tanh(xtWg

x+ht−1Wg

h+bg)(2)

i=sigmoid(xtWi

x+ht−1Wi

h+bi)(3)

f= tanh(xtWf

x+ht−1Wf

h+bf)(4)

o= tanh(xtWo

x+ht−1Wo

h+bo)(5)

Where Wxis the weight for the input, Whis the weight for

the previous cell output and b is the bias input. Finally, the

output of the LSTM cell is being calculated by the formulas

6 and 7.

st=g∗i+st−1∗f(6)

ht= tanh(st)∗st(7)

Here, * denotes the element wise multiplication.

In our method, two LSTM cells are used sequentially (Fig.

4). In the ﬁrst LSTM, 128 neurons are used and 64 neurons are

used in the second LSTM. After each LSTM cell, a dropout

layer with a rate of 0.5 is added that reduces overﬁtting. A

fully connected layer placed after LSTM layers. A softmax

function is used ﬁnally to process the output.

Fig. 4: Character level LSTM (Layered architecture including embedding layer, LSTM cells, and activation functions).

IV. RES ULTS AND ANALYSIS

A. Dataset

In our experiment, we have chosen three different datasets

to train and validate our model. Datasets are BARD [19],

Prothom Alo [20], and Open Source Bengali Corpus (OSBC)

dataset [21]. Among the three datasets, Prothom Alo is col-

lected by web scraping and the other two is open-source

dataset. All these data repositories contain a different num-

ber of Bangla documents. Some data have been removed

from these datasets if they have any character which doesn’t

represent the Bangla language. To measure the performance

efﬁciently, we have used 3 different sizes of data from each

of these datasets. The overall dataset and its splitting are

presented by the TABLE II. Furthermore, we have partitioned

the selected datasets into a ratio of 80 to 20 for training and

testing purposes.

TABLE II: OVERALL DATASET SPLITTING.

Name BARD Prothom Alo OSBC

Training data 40448 103008 63036

Testing data 10112 25753 15760

Total data 50560 128761 78796

B. Experiments

We trained our models using Nvidia Geforce 2080 Super

GPU. During training, we used batch size 8 and 100 epochs for

every experiment. We used cross-entropy as the loss function

and adam as the optimizer with initial learning rate 0.001 to

train the models.

C. Results

Accuracy and F1 score parameters are computed to analyze

the performance of the working models. Accuracy and F1

score are calculated using following formula set [22] 8, 9,

10, 11.

P recision =T P

T P +F P (8)

Recall =T P

T P +F N (9)

Accuracy =T P +T N

T P +F P +F N +T N (10)

F1 = (2 ∗P recision ∗Recall)

P recision +Recall (11)

a. Accuracy of CNN over epochs.

b. Loss of CNN over epochs.

Fig. 5: Accuracy and loss of CNN over epochs.

Where TP, FP, TN, FN are the true positive, false positive,

true negative, and false negative respectively.

Fig. 5 and Fig. 6 depicts the validation accuracy and loss of

CNN and LSTM model for all the working datasets over the

epochs. While TABLE III shows the model’s ﬁnal accuracy

and F1 score.

D. Discussions and Findings

From Fig. 5, it is observed that CNN performs well while

performing classiﬁcation on the Prothom Alo dataset. The

TABLE III: EXPERIMENTAL RESULT COMPARISON OF CNN AND LSTM

BARD Prothom Alo OSBC

Models Accuracy F1 Accuracy F1 Accuracy F1

CNN 92.85% 91.69% 94.44% 91.08% 78.10% 71.59%

LSTM 92.41% 90.76% 95.42% 92.57% 82.06% 78.07%

a. Accuracy of LSTM over epochs.

b. Loss of LSTM over epochs.

Fig. 6: Accuracy and loss of LSTM over epochs.

observed accuracy of CNN for the Prothom Alo dataset is

higher than other datasets and the validation loss of CNN for

the Prothom Alo dataset is comparatively low than the other

datasets. Fig. 6 demonstrates the validation accuracy and loss

for LSTM models. In terms of accuracy and loss, here it is

also observed that LSTM performs well for the Prothom Alo

dataset like CNN.

V. CONCLUSION

In this paper, a character level approach is implemented

that categorizes Bangla text documents. Two well-known Deep

Learning models namely CNN and LSTM are used for the

classiﬁcation task. Here, 80% of data is used for the model’s

training from each of the datasets and the remaining 20% data

is used for the testing purposes. The presented character level

encoding scheme improves the accuracy of the classiﬁcation

task. This scheme ﬁrst tokenized the documents to their

character level and generate a sequence of character identities

on which character embedding is applied. The overall perfor-

mance of the LSTM model is satiable for three of the different

datasets. The highest accuracy and F1 score are obtained for

the Prothom Alo dataset by using the LSTM model which is

95.42% and 92.57%. For future studies, we will try to build

some more advance and robust hybrid models by integrating

different Deep Learning architectures to classify documents.

We will also implement a Bi-directional LSTM, Bert model

for categorizing Bangla text documents.

REFERENCES

[1] ”A language of Bangladesh”, Last Accessed: July 29,2020. [Online].

Available: https://www.ethnologue.com/language/ben

[2] A. Bilski, “A review of artiﬁcial intelligence algorithms in document

classiﬁcation,” International Journal of Electronics and Telecommuni-

cations, vol. 57, pp. 263–270, 2011.

[3] Y. Juhn and H. Liu, “Artiﬁcial intelligence approaches using natural

language processing to advance ehr-based clinical research,” Journal of

Allergy and Clinical Immunology, vol. 145, no. 2, pp. 463–469, 2020.

[4] D. Shimada, R. Kotani, and H. Iyatomi, “Document classiﬁcation

through image-based character embedding and wildcard training,” in

2016 IEEE International Conference on Big Data (Big Data). IEEE,

2016, pp. 3922–3927.

[5] B. Adams and G. McKenzie, “Crowdsourcing the character of a place:

Character-level convolutional networks for multilingual geographic text

classiﬁcation,” Transactions in GIS, vol. 22, no. 2, pp. 394–408, 2018.

[6] T. Wang, L. Liu, H. Zhang, L. Zhang, and X. Chen, “Joint character-level

convolutional and generative adversarial networks for text classiﬁcation,”

Complexity, vol. 2020, 2020.

[7] A. N. Samatin Njikam and H. Zhao, “Chartec-net: An efﬁcient and

lightweight character-based convolutional network for text classiﬁca-

tion,” Journal of Electrical and Computer Engineering, vol. 2020, 2020.

[8] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional

networks for text classiﬁcation,” in Advances in neural information

processing systems, 2015, pp. 649–657.

[9] A. Dhar, N. Dash, and K. Roy, “Classiﬁcation of text documents through

distance measurement: An experiment with multi-domain bangla text

documents,” in 2017 3rd International Conference on Advances in

Computing, Communication & Automation (ICACCA)(Fall). IEEE,

2017, pp. 1–6.

[10] A. Dhar, N. S. Dash, and K. Roy, “Application of tf-idf feature for

categorizing documents of online bangla web text corpus,” in Intelligent

Engineering Informatics. Springer, 2018, pp. 51–59.

[11] M. S. Haydar, M. Al Helal, and S. A. Hossain, “Sentiment extraction

from bangla text: A character level supervised recurrent neural network

approach,” in 2018 International Conference on Computer, Communica-

tion, Chemical, Material and Electronic Engineering (IC4ME2). IEEE,

2018, pp. 1–4.

[12] M. Rahman, S. Haque, and Z. R. Saurav, “Identifying and categorizing

opinions expressed in bangla sentences using deep learning technique,”

International Journal of Computer Applications, vol. 975, p. 8887.

[13] F. Kabir, S. Siddique, M. R. A. Kotwal, and M. N. Huda, “Bangla

text document categorization using stochastic gradient descent (sgd)

classiﬁer,” in 2015 International Conference on Cognitive Computing

and Information Processing (CCIP). IEEE, 2015, pp. 1–4.

[14] J. L. Fagan, M. D. Gunther, P. D. Over, G. Passon, C. C. Tsao,

A. Zamora, and E. M. Zamora, “Method for language-independent text

tokenization using a character categorization.” Google Patents, Feb. 5

1991, uS Patent 4,991,094.

[15] Y. Li, Z. Hao, and H. Lei, “Survey of convolutional neural network,”

Journal of Computer Applications, vol. 36, no. 9, pp. 2508–2515, 2016.

[16] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-

neural-network architectures and learning methods for spoken language

understanding,” in Interspeech, 2013, pp. 3771–3775.

[17] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and

long short-term memory (lstm) network,” Physica D: Nonlinear Phe-

nomena, vol. 404, p. 132306, 2020.

[18] S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, and L. Heck,

“Contextual lstm (clstm) models for large scale nlp tasks,” arXiv preprint

arXiv:1602.06291, 2016.

[19] M. T. Alam and M. M. Islam, “Bard: Bangla article classiﬁcation using

a new comprehensive dataset,” in 2018 International Conference on

Bangla Speech and Language Processing (ICBSLP). IEEE, 2018, pp.

1–5.

[20] ”Prothom Alo”, Last Accessed: August 14, 2020. [Online]. Available:

https://www.prothomalo.com/

[21] ”Bangla Dataset (Corpus)”, Last Accessed: August 14, 2020. [Online].

Available: https://scdnlab.com/corpus/

[22] N. D. Marom, L. Rokach, and A. Shmilovici, “Using the confusion

matrix for improving ensemble classiﬁers,” in 2010 IEEE 26-th Con-

vention of Electrical and Electronics Engineers in Israel. IEEE, 2010,

pp. 000 555–000 559.

Toward Embedding Hyperparameters Optimization: Analyzing Their Impacts on Deep Leaning-Based Text Classification

Chapter

Jun 2023

In the last few years, an enormous amount of unstructured text documents has been added to the World Wide Web because of the availability of electronics gadgets and increases the usability of the Internet. Using text classification, this large amount of texts are appropriately organized, searched, and manipulated by the high resource language (e.g., English). Nevertheless, till now, it is a so-called issue for low-resource languages (like Bengali). There is no usable research and has conducted on Bengali text classification owing to the lack of standard corpora, shortage of hyperparameters tuning method of text embeddings and insufficiency of embedding model evaluations system (e.g., intrinsic and extrinsic). Text classification performance depends on embedding features, and the best embedding hyperparameter settings can produce the best embedding feature. The embedding model default hyperparameters values are developed for high resource language, and these hyperparameters settings are not well performed for low-resource languages. The low-resource hyperparameters tuning is a crucial task for the text classification domain. This study investigates the influence of embedding hyperparameters on Bengali text classification. The empirical analysis concludes that an automatic embedding hyperparameter tuning (AEHT) with convolutional neural networks (CNNs) attained the maximum text classification accuracy of 95.16 and 86.41% for BARD and IndicNLP datasets.KeywordsNatural language processingLow-resource text classificationHyperparameters tuningEmbeddingFeature extraction

Recent Advancements and Challenges in Bangla Textual Document Classification: A Review

Conference Paper

Full-text available

Dec 2023

Bengali is the world's sixth-most frequently used dialect. Natural language processing (NLP), which enables data scientists to extract useful information from the constantly growing quantity of text data available on the net, includes text categorization as a key component. It is the act of organizing already-created groups or classes of text documents. But compared to other languages, like English, Bengali does not have as many capabilities for classifying texts. This study offers an extensive analysis of the current categorization of Bengali documents. Also gives a comprehensive summary of recent advances in ML and DL, covering datasets, methods, performance, accuracy rate, classification category, strengths, and limits. The purpose of this discussion is to identify current challenges that must be addressed and recommend some guidelines for future studies that might be explored.

Engineering Applications of Artificial Intelligence Leveraging the meta-embedding for text classification in a resource-constrained language

Article

Jun 2023
ENG APPL ARTIF INTEL

This paper proposes an intelligent text classification framework for a resource-constrained language like Bengali, which is considered a challenging task due to the lack of standard corpora, appropriate hyper-parameter tuning method, and pre-trained language-specific embedding. The proposed framework comprises an average meta-embedding feature fusion module and a convolutions neural network module called AVG-M+CNN. This work also proposes an algorithm, i.e., automatic hyperparameter tuning and selection, for enhancing the performance of the AVG-M+CNN technique. All meta-embedding models are evaluated using the intrinsic, e.g., semantic, syntactic, relatedness word similarity, analogy tasks and extrinsic evaluators. The intrinsic evaluator evaluates 200 Bengali semantic, syntactic and relatedness word pairs. Spearman (̂), Pearson (̂) and cosine similarity correlations are used to evaluate 18 individual embedding and 9 meta-embedding models. The 3COSADD and 3COSMUL evaluators evaluate the 300 analogy tasks. The extrinsic evaluator evaluates a total of 156 classification models on four corpora: BARD, IndicNLP, Prothom-Alo and 11 (a newly developed corpus having eleven distinct categories). Among these, the AVG-M+CNN model achieves the highest accuracy regarding four Bengali corpora: 95.92±.001% for BARD, 93.10±.001% for Prothom-Alo, 90.07±.001% for 11 and 87.44±.001% for IndicNLP, respectively.

Bangla Document Classification Based on Machine Learning and Explainable NLP

Conference Paper

Full-text available

Dec 2023

Massive digital texts are now accessible, thanks to technological advancement. Any amount of disorganized writing is useless. A high-quality representative corpus of any particular language is essential for research in computational linguistics and natural language processing (NLP). Bangla NLP research is still in its infancy because of the dearth of high-quality public corpus. This paper proposed a newly produced corpus consists of 1,30,307 documents covering 10 categories collected from 11 websites, having 2,94,80,828 tokens and 17,59,085 unique tokens. Seven supervised machine learning methods are explored in this work. Furthermore, Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive explanations (SHAP) are also examined to explain about different model performance. The obtained results show that the Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) outperform other models. RF classifier achieves the highest accuracy 99.91% which is better than the existing state-of-the-art methods.

Clasificación de categorías de noticias usando BERT

Article

Full-text available

Sep 2023

El presente proyecto consiste en desarrollar un modelo de Procesamiento del Lenguaje Natural para clasificar noticias utilizando un conjunto de datos o DataSets ya evaluados. El objetivo principal es crear un sistema que pueda identificar y asignar automáticamente las noticias a una de las categorías predefinidas: negocios, entretenimiento, política, deportes o tecnología. Esto implica el preprocesamiento de datos, extracción de características, entrenamiento de un modelo de machine learning y posteriormente su evaluación de rendimiento utilizando métricas como” precisión”,” recall 2” F1 − score”. Esto permitir ‘a determinar que tan bien el modelo puede predecir la categoría correcta para una noticia nueva o no etiquetada. Si el rendimiento del modelo es satisfactorio, se puede utilizar para clasificar noticias no etiquetadas en tiempo real. En resumen, se busca proporcionar una solución eficiente y precisa para organizar y etiquetar el contenido informativo de una noticia con ayuda de la Inteligencia Artificial.

Parts-of-Speech Tagger in Assamese Using LSTM and Bi-LSTM

Chapter

Feb 2024

Parts-of-speech (POS) tagging is considered one of the most challenging fields in natural language processing (NLP). The objective of this research is to develop a POS tagger for the Assamese language. Due to the scarcity of digital linguistic resources, Assamese lacks high-performing POS taggers. To fill this gap, long short-term memory (LSTM) and bidirectional long short-term memory (Bi-LSTM) are explored in the proposed research to develop a POS tagger using an Assamese POS corpus. It is important to note that this experiment faced difficulties in understanding and managing natural language for computational linguistics, which was also anticipated. The Assamese corpus considered in this research comprises around 50,000 words. At the initial stage, while examining the first few sets of data, which are about 20,000 words, it was noticed that the taggers yielded satisfactory results. Based on the result derived, the Assamese corpus size has been enhanced to 50,000 words and better performance is noted in terms of accuracy rate, precision, recall, and F1-score. As a result, an accuracy of 91.20% is achieved for LSTM and 91.72% for Bi-LSTM. Concerned with substantial research from the NLP perspective for the Assamese language and for comparative purposes, a comparison between existing POS taggers in the Assamese language and the proposed work is also presented.

Feature Extraction Using Deep Generative Models for Bangla Text Classification on a New Comprehensive Dataset

Preprint

Full-text available

Aug 2023

The selection of features for text classification is a fundamental task in text mining and information retrieval. Despite being the sixth most widely spoken language in the world, Bangla has received little attention due to the scarcity of text datasets. In this research, we collected, annotated, and prepared a comprehensive dataset of 212,184 Bangla documents in seven different categories and made it publicly accessible. We implemented three deep learning generative models: LSTM variational autoencoder (LSTM VAE), auxiliary classifier generative adversarial network (AC-GAN), and adversarial autoencoder (AAE) to extract text features, although their applications are initially found in the field of computer vision. We utilized our dataset to train these three models and used the feature space obtained in the document classification task. We evaluated the performance of the classifiers and found that the adversarial autoencoder model produced the best feature space.

978-981-19-8032-9

Book

Jul 2023

Focuses on the research trends, challenges, and future of artificial intelligence

Segmentation of Handwritten University Examination Scripts

Conference Paper

May 2023

BEN-CNN-BiLSTM: A Model of Consequential Document Set Identification of Bengali Text

Chapter

Feb 2023

Document set identification assigns a text to its predefined text set. Therefore, the objective of any classification work is to create a model that can categorise various texts and objects into distinct classes. In this paper, three consequential deep learning based models, viz. BiLSTM, BiLSTM with attention layer, and CNN-BiLSTM have been used, which have the auto-learning capability in Bengali corpora. The CNN-BiLSTM model has been designated the classification model for its best performance in categorising Bengali text documents. This consequential model, named BEN-CNN-BiLSTM, was learned with Bengali text documents to determine the category of an unknown Bengali document. At first, more than four lac news articles from renowned Bengali newspapers are processed. After that, the training data is processed and entered into the proposed model. Finally, the model performance is assessed using the test dataset to calculate recall, precision, F-score, and accuracy. Compared to other standard classification algorithms in Bengali text classification, our proposed BEN-CNN-BiLSTM model achieved 93.94% accuracy. Thus, it can be said that the proposed BEN-CNN-BiLSTM model can be a new document set identification technique for Bengali datasets.

CharTeC-Net: An Efficient and Lightweight Character-Based Convolutional Network for Text Classification

Article

Full-text available

Jun 2020

This paper introduces an extremely lightweight (with just over around two hundred thousand parameters) and computationally efficient CNN architecture, named CharTeC-Net (Character-based Text Classification Network), for character-based text classification problems. This new architecture is composed of four building blocks for feature extraction. Each of these building blocks, except the last one, uses 1 × 1 pointwise convolutional layers to add more nonlinearity to the network and to increase the dimensions within each building block. In addition, shortcut connections are used in each building block to facilitate the flow of gradients over the network, but more importantly to ensure that the original signal present in the training data is shared across each building block. Experiments on eight standard large-scale text classification and sentiment analysis datasets demonstrate CharTeC-Net’s superior performance over baseline methods and yields competitive accuracy compared with state-of-the-art methods, although CharTeC-Net has only between 181,427 and 225,323 parameters and weighs less than 1 megabyte.

Joint Character-Level Convolutional and Generative Adversarial Networks for Text Classification

Article

Full-text available

Apr 2020
COMPLEXITY

With the continuous renewal of text classification rules, text classifiers need more powerful generalization ability to process the datasets with new text categories or small training samples. In this paper, we propose a text classification framework under insufficient training sample conditions. In the framework, we first quantify the texts by a character-level convolutional neural network and input the textual features into an adversarial network and a classifier, respectively. Then, we use the real textual features to train a generator and a discriminator so as to make the distribution of generated data consistent with that of real data. Finally, the classifier is cooperatively trained by real data and generated data. Extensive experimental validation on four public datasets demonstrates that our method significantly performs better than the comparative methods.

Identifying and Categorizing Opinions Expressed in Bangla Sentences using Deep Learning Technique

Article

Full-text available

Apr 2020

Identifying and categorizing opinions in a sentence is the most prominent branch of natural language processing. It deals with the text classification to determine the intention of the author of the text. The intention can be for the presentation of happiness, sadness, patriotism, disgust, advice, etc. Most of the research work on opinion or sentiment analysis is in the English language. Bengali corpus is increasing day by day. A large number of online News portals publish their articles in Bengali language and a few News portals have the comment section that allows expressing the opinion of people. Here a research work has been done on Bengali Sports news comments published in different newspapers to train a deep learning model that will be able to categorize a comment according to its sentiment. Comments are collected and separated based on immanent sentiment.

Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network

Article

Full-text available

Mar 2020
PHYSICA D

Alex Sherstinsky

Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of “unrolling” an RNN is routinely presented without justification throughout the literature. The goal of this tutorial is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in Signal Processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the “Vanilla LSTM”¹ network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this treatise valuable as well.

BARD: Bangla Article Classification Using a New Comprehensive Dataset

Conference Paper

Full-text available

Aug 2018

In the literature, automated Bangla article classification has been studied, where several supervised learning models have been proposed by utilizing a large textual data corpus. Despite several comprehensive textual datasets are available for different languages, a few small datasets are curated on Bangla language. As a result, a few works address Bangla document classification problem, and due to the lack of enough training data, these approaches could not able to learn sophisticated supervised learning model. In this work, we curated a large dataset of Bangla articles from different news portals, which contains around 3,76,226 articles. This huge diverse dataset helps us to train several supervised learning models by utilizing a set of sophisticated textual features, such as word embeddings,TF-IDF. In this works, our learning model shows promising performance on our curated dataset, compared to state-of-the-art works in Bangla article classification

Sentiment Extraction From Bangla Text : A Character Level Supervised Recurrent Neural Network Approach

Conference Paper

Full-text available

Feb 2018

Over the recent years, people are heavily getting involved in the virtual world to express their opinions and feelings. Each second, hundreds of thousands of data are being gathered in the social media sites. Extraction of information from these data and finding their sentiments is known as a sentiment analysis. Sentiment analysis (SA) is an autonomous text summarization and analysis system. It is one of the most active research areas in the field of NLP and also widely studied in data mining, web mining and text mining. The significance of sentiment analysis is picking up day by day due to its direct impact on various businesses. However, it is not so straightforward to extract the sentiments when it comes to the Bangla language because of its complex grammatical structure. In this paper, a deep learning model was developed to train with Bangla language and mine the underlying sentiments. A critical analysis was performed to compare with a different deep learning model across different representation of words. The main idea is to represent Bangla sentence based on characters and extract information from the characters using a Recurrent Neural Network (RNN). These extracted information are decoded as positive, negative and neutral sentiment.

Crowdsourcing the character of a place: Character-level convolutional networks for multilingual geographic text classification

Article

Full-text available

Jan 2018

This article presents a new character-level convolutional neural network model that can classify multilingual text written using any character set that can be encoded with UTF-8, a standard and widely used 8-bit character encoding. For geographic classification of text, we demonstrate that this approach is competitive with state-of-the-art word-based text classification methods. The model was tested on four crowdsourced data sets made up of Wikipedia articles, online travel blogs, Geonames toponyms, and Twitter posts. Unlike word-based methods, which require data cleaning and pre-processing, the proposed model works for any language without modification and with classification accuracy comparable to existing methods. Using a synthetic data set with introduced character-level errors, we show it is more robust to noise than word-level classification algorithms. The results indicate that UTF-8 character-level convolutional neural networks are a promising technique for georeferencing noisy text, such as found in colloquial social media posts and texts scanned with optical character recognition. However, word-based methods currently require less computation time to train, so are currently preferable for classifying well-formatted and cleaned texts in single languages.

A Survey of Convolutional Neural Network and Its Variants

Conference Paper

Oct 2022

Natural language processing to advance EHR-based clinical research in Allergy, Asthma, and Immunology

Article

Dec 2019
J ALLERGY CLIN IMMUN

The wide adoption of electronic health record systems (EHRs) in health care generates big real-world data that opens new venues to conduct clinical research. As a large amount of valuable clinical information is locked in clinical narratives, natural language processing (NLP) techniques as an artificial intelligence approach have been leveraged to extract information from clinical narratives in EHRs. This capability of NLP potentially enables automated chart review for identifying patients with distinctive clinical characteristics in clinical care and reduces methodological heterogeneity in defining phenotype obscuring biological heterogeneity in research concerning allergy, asthma, and immunology. This brief review discusses the current literature on the secondary use of EHR data for clinical research concerning allergy, asthma, and immunology and highlights the potential, challenges, and implications of NLP techniques.

Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus

Chapter

Jan 2018

Bangla Document Classification using Character Level Deep Learning

Abstract and Figures

Recommended publications

Audio and Caption Generation using ML

Bangla Documents Classification using Transformer Based Deep Learning Models

Bangla Document Classification Based on Machine Learning and Explainable NLP

An in-depth analysis of Convolutional Neural Network architectures with transfer learning for skin d...

An In-Depth Exploration of Bangla Blog Post Classification