Conference PaperPDF Available

A Review of State Art of Text Classification Algorithms

April 2021

April 2021

DOI:10.1109/ICCMC51019.2021.9418262

Conference: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC)

Authors:

Ashapu Bhavani

GMR Institute of Technology

Dr Santhosh Kumar

Guru Nanak Institute of Technology

In the recent years, the categorization of text documents into predefined classifications has perceived a growing interest due to the growing of documents in digital form and needs to organize them. Text categorization is one of the extensively used for natural language processing (NLP) applications have achieved using machine learning algorithms. Text classification is a challenging researcher to find the best suitable structure and technique. Classification process done using manual and automatic classification. This research paper covers the preprocessing, feature extraction, different algorithms and techniques for text classification and finally evaluates the performance metrics for assessment.

Content uploaded by Ashapu Bhavani

Content may be subject to copyright.

A Review of State Art of Text Classification

Algorithms

A.Bhavani

Assistant Professor/Computer Science & Engineering

GMR Institute of Technology

Rajam, Andhra Pradesh, India

bhavaniashapu@gmail.com

Dr. B. Santhosh Kumar, SMIEEE

Professor/ Computer Science & Engineering,

Guru Nanak Institute of Technology,

Hyderabad, Telangana, India

drbsanthoshkumar@ieee.org

Abstract— In the recent years, the categorization of text

documents into predefined classifications has perceived a

growing interest due to the growing of documents in digital form

and needs to organize them. Text categorization is one of the

extensively used for natural language processing (NLP)

applications have achieved using machine learning algorithms.

Text classification is a challenging researcher to find the best

suitable structure and technique. Classification process done

using manual and automatic classification. This research paper

covers the preprocessing, feature extraction, different algorithms

and techniques for text classification and finally evaluates the

performance metrics for assessment.

Keywords—Text Classification; Preprocessing; Feature

Extraction; Classification Algorithms; Evaluation Masures;

Natural Language Processing; Deep Learning Models;

I. INTRODUCTION

Text classification is the technique of categorizing text or

tags into organized classes according to its content. Text

classification also called text tagging or text categorization. It

is one of the significant tasks in natural language processing

(NLP) with extensive applications consisting of intent

detection, topic labeling, spam detection, and sentiment

analysis. By using NLP, Text classifiers may automatically

identify text after which a pre-described collection of tags or

categories are allocated according to their topics from medical

studies, files, documents and all over the world. Although the

classifier defines the category of textual content are classified,

the similarity of all inputs in the training sets should be

determined. Thus, classification performance assessment will

fall down with the rise of training inputs [1]. The classification

of the automated arrangement of documents became the main

function for knowledge discovery and information

organization as the availability of electronic documents and

the growth of the internet increased. [2].

Natural language processing (NLP), machine learning

techniques and data mining used to identify and discover

patterns in electronic documents automatically. Text mining's

main purpose is to enable users to extract information from

textual tools and to deal with operations. The aim of

Information Extraction (IE) methods is to extract precise data

from text documents. This is the first approach, which

suggests that text mining is the same as data extraction. The

process of finding documents that provide answers to

questions known as information retrieval (IR). To achieve this,

statistical measurements and methods for automated text data

processing and reference to the given query are used. In its

largest context, information retrieval encompasses the entire

spectrum of information processing, from data retrieval to

knowledge retrieval. Text classification contains the steps for

processing classification are followed as pre-processing,

feature extraction and classification algorithm models for

training and prediction as shown in figure1. Training Data set

are used for train the data and prediction is used for find the

labelled output after classification of text document with

different machine learning algorithms as different classifier

models.

Fig.1. Text classification Process

A. Pre-processing

Pre-processing is the important task for classification of text in

natural language processing. There are three main components

for processing. Those are Tokenization, Normalization and

Noise removal.

Tokenization: Tokenization involves strings that divided into

smaller tokens. It is possible to tokenize paragraphs into

sentences and tokenize sentences into words.

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

Normalization: Normalization objects for eliminate common

and frequent words like the, a etc. and remove irrelevant

words.

Noise removal: Noise removal used to removes the unwanted

text and it converts different form of words into similar

canonical words [3].

Fig.2. Text classification Preprocessing

B. Feature extraction

The important task of text classification is feature extraction.

It gives not only more accuracy and gives greater probability,

saves computational time. Feature extractions have different

characteristics are word bag and embedding and extract the

different characteristics from text [4]. The features converted

into different vectors as shown in fig. 1. The vectors are

counted in the form of TF-IDF Vectors.TF-IDF is a Term

Frequency and Inverse Document Frequency for two-word

vectors [5]. For TF, the frequency of occurrence in a text of a

word t, d is determined and assigned to the same word as

weight. The value of word in the text document calculated by

the IDF. By using the following equation, the IDF of word t

was determined

IDFt= (1)

The number of documents shows where N is. Now, using the

following equation, the TF-IDF formula for word t are

estimated in document d

TF-IDF = TF *IDFt (2)

After preprocessing and feature extraction, going to

the next step is classify the text documents using Machine

learning classification techniques. Based on these algorithms,

documents classified to single class or multi class labels.

Finally, the performance evaluation measures used for finding

the accuracy and recall. Based on these two measures, can

identify which classification algorithm performed well and

takes less time for computation for classified documents.

II. LITERATURE SURVEY

Compared to the old text classification algorithm, the

KNN algorithm has less accuracy and reduces the time

complexity, significantly increases the efficacy of text

classification for classifying the text. As compared to other

machine learning algorithms, Naive Bayes performs well on

textual and numeric data, but its conditional independence

principle violated by real-life data, it ignores the frequency of

word occurrences, and it performs poorly when features are

strongly correlated. So, this algorithm used for email

classification and spam filtering.

In machine learning algorithms for text classification,

SVM classification is the most effective algorithm comparing

with all supervised Machine Learning algorithms. SVM for

better capture the in-built characteristics of the data.

Moreover, the ability to learn can be independent of the

function space's dimensionality and global vs. local minima.

When it comes to kernel selection and parameter tuning, the

SVM has some difficulties. In comparison of other algorithms,

KNN and Naïve bayes are suitable for pre-processing. KNN is

suitable for continues to achieve good results, in the case of

increasing the number of documents, but this is not possible

with SVM.

In KNN difficult to find K optimal value and it takes

more classification time. In addition, more work is required

for improvement of performance and classification of

document accuracy. In increased electronic documents, new

algorithms and solutions used for knowledge beneficial. Here,

rough set method used with KNN text classification algorithm

that is improved KNN text classification algorithm. Finding

similarity based on distance measures to improve

classification performance [6].

This paper presents the importance of different text

classification techniques and challenges to solve with these

algorithms. It shows about variant techniques in feature

selection for finding the better accuracy. This semantic and

ontology methods are used for information retrieval and

document classification. The paper explains about

classification evaluates the performances of the classifiers

based on different training text and high or good quality

training measures may originate classifiers of good evaluation

of performance [7].

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

Jing Ouyang says that Support Vector Machine (SVM)-

based text classification and device data parameters improve

when specific experiments conducted, so the classification

algorithm significantly improves the efficiency and accuracy

of the classifier for text classification [8].

The authors explains about different applications are

categorized into topic classification, which is classify text

based on topic and relevance or aspect, and sentiment analysis

for detecting sentiment in text gives the positive or negative as

result and intent classification is classify text based on intent

for example feedback, request and complaint [9].

The following parameters used for measuring the

performance of classifiers: True positive (TP) as correctly

identified, False positive (FP) as incorrectly identified, True

negative (TN) as correctly rejected, False negative (FN) as

incorrectly rejected [3]. Precision, recall, F-Measure, Micro,

and Macro Average used in the text classification evaluation

index [10].

These parameters are used for medical data set, in this the

algorithm used different modules for classification wither

more accuracy. These methods of feature learning module

having word embedding layer, positional encoding, attention

mechanism, gradient reversal layer and adversarial network

for word extraction module based on adversarial network and

text classification based on dual channel subtraction [11].

This paper, two distance measure techniques used for

better accuracy for bangla text. Different classification

algorithms use input text divided into domains such as

business, state, science, sports, and medicine. The

classification techniques are Euclidean distance, cosine

similarity, J48, Naïve bayes, naïve bayes multinomial, random

forest and simple logistic. The Euclidean distance and cosine

similarity generates best accuracy like 95.20% and 95.80%

[13]. This paper uses the TF, IDF, and CHI to find the

classification of legal provisions and characteristics words.

Here, the class belongs to and does not belongs to classes are

defined with A, B, C and D for finding the measures. The

English legal text giving the input parameters are society,

military, sports, finance, IT, property, education,

entertainment are used for traditional TF-IDF and improved

TF-IDF algorithms gives better performance [14].

For patent text classification, multivariate neural network

fusion used. In this parameters are Model parameter Word

embedding dimension, Convolution kernel size set, Number of

BIGRU nodes, Dropout, Batch size, Epoch size, Optimizer

and Loss having the Parameter values like 300,（3,4,5）, 128

0.5, 64, 100, rmsprop, Categorical crossentrop [15].

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Chinese Text Short Text English T ext News

Naive Bayes

KNN

Random Forest

SVM

CNN

RNN

Fig.3. Comparison of accuracy for ML techniques

Unsupervised learning algorithm is clusters used for text

classification, removing noisy data, and compared to Bayes

algorithms for incremental learning algorithms; the cluster

algorithm improves efficiency and improves the accuracy of

incremental learning. [12]. For English text, the author used

two different SVM algorithms are SVM and improved SVM.

Therefore, in this the improved SVM has the following

characteristics. Those are good fault tolerance, based on the

user requirement filter the information, stable process running,

very few system bugs, speed of the system also good, and

takes less time for processing.

In text classification, pre-processing is the first and

primary phase. The pre-processing for dataset having the

different steps followed by plaintext as input data, data set

partition as divided into different word bags. The data

vectorization as different n-gram of data, divided into training

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

set and test set after that build the training dataset for multiple

algorithms. Finally predicting the test set based on the result of

training set and calculate the accuracy for all algorithms. In

this paper, textCNN, TextRNN and TextRNN + Attention

methods used for text classification. The text having different

forms with special sequence and this context maintain the

syntax and semantic dependence. In this paper, the

implemented algorithms are deep learning algorithms of

conventional neural network (CNN) and recurrent neural

network algorithm (RNN).

The RNN technique can deal with semantic dependences

to attain significant classification but cannot parallelize well.

So this algorithm is suitable for short text. For long text, it

takes more training time when the processing text. The CNN

can parallelize well for achieving classification. This is

suitable for long text and different size of filter of text. This

algorithm uses multiple channels and the most influential can

select using max pooling. For feature selection, it uses a

dropout of full connection layer for getting the result of

classification. These algorithms are generated the result of

accuracy like CNN gives 85% and RNN gives 82%. Based on

Text detection and image detection, we can also do the text

classification using convolutional Neural Networks [16].

The classification of text has more advanced, and it

requires six key components. The components are data

collection, data processing and labelling, extraction and

weighting of features, selection or projection of features,

model training and evaluation of solutions. For the fast text

data the algorithms SVM, KNN, Naïve bayes had generated an

accuracy of 91%, 88% and 86%. [17].

In feature extraction, the word tags divided into unigram

and bigram word for easily finding the accuracy and fast

classification of text document using TF-IDF. Using British

English and American English, the algorithm SVM generate

92.1% of accuracy in stemming test compared with normal

text. While unigram text has given accuracy of 91.2%

compared with bigram and N-gram, by using linear kernel

accuracy 87.1% was generated compared with different kernel

algorithms, taking threshold value from 1 to 10 highest

frequency of 94.0% was generated at threshold value 2 [18].

The Hadoop Map Reduce facing many limitations in the

field of text classification with K-Means. This paper proposed

hadoop Map Reduce with Naïve Bayes classifier algorithm for

reducing these problems. This classifier is much better for

time and space complexity. It is easy to work and simple. In

the medical data set, for hadoop map reduce 47.85%, for

Gaussian naïve Bayes 72.13%, for Bernoulli naive bays

65.5%, multinomial naïve Bayes 52.45% of accuracy

generated. [19].

In this paper, the traditional TF-IDF algorithm cannot

return the dissemination of feature words in different

categories. Therefore, the author improved this algorithm as

improved TF-IDF algorithm and this method implemented

with weighting factor, which gives the degree of inter class

distribution and intra class distribution, and association

between class and feature words. This paper used the

Word2vec model for represent the vectors of text word in

place of vector space. This implemented with word vector and

weight of word for Word2vec model. Micro average and

macro averages used for performance evaluation. The micro

average gives more accuracy than macro average for both

traditional and improved TF-IDF. The improved TF-IDF gives

more accuracy for classification with comparison of traditional

TF-IDF and produces the result more effectively [20].

The above authors discussed about different machine

learning algorithms and deep learning algorithms. These

algorithms are applied for Chinese text, short text, English text

and news text and those domains are generated the accuracy as

shown as above fig 3. Nowadays, mostly and updated text

classification algorithms are deep learning techniques with sub

domains and these techniques gives more accuracy and takes

less time for computation. Finally, these algorithms are

appropriate for large amount of data with multiclass text

classification.

III. TEXT CLASSIFICATION ALGORITHMS

Machine learning algorithms classified into unsupervised,

supervised, and semi-supervised approaches used to classify

documents. Many techniques and algorithms for clustering

and classification of electronic documents recently proposed.

Using the current literature, this section focused on supervised

classification methods, emerging technologies, and some of

the opportunities and challenges. As the internet use rate has

rapidly increased, the automated sorting of documents into

predefined categories has been observed as an active attention.

The role of automatic text classification extensively studied in

recent years, and rapid progress appears to make in this field,

including Machine Learning approaches such as K-Nearest

Neighbors (KNN), Naïve Bayes, Support Vector Machine

(SVM), Decision Tree, Neural Networks, Convolutional

Neural Network (CNN), and Recurrent Neural Network

(RNN). Automatic text classification typically employs

supervised learning techniques, in which pre-defined category

labels assigned to documents based on the probability

indicated by a training collection of labelled documents. Some

of these methods shown below:

Fig.4. Machine Learning Algorithms

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

A. Linear Regression

It is the basic and primary method in machine learning for

text classification. The linear regression used for finding the

estimate real values based on independent variables. For

fitting the linear line, we are giving the relationship between

dependent variables and independent variables. This best

linear line is called as regression line and the equation was

represented as y=mx+c.

Where y as dependent variable,

x as independent variable,

m as intercept slope and

c intercept constant.

The m and c coefficients calculated by minimizing the

number of squared differences in distance between regression

line and data points. Linear regression classified into two

categories. Those are simple linear regression and multiple

linear regressions. Simple linear regression is derived using

one independent variable. Multiple linear regressions

calculated by using more than one independent variable. We

are using curve linear regression or polynomial regression for

fitting the linear regression line.

B. K-Nearest Neighbors (K-NN)

KNN is the Machine Learning algorithm for regression and

classification problems. The KNN algorithm is a simple

method that stores all available cases and classifies new cases

based on a majority vote of its k neighbours. The case that

assigned to the class is the most common among its K closest

neighbours, as determined by a distance function. These

distance functions can be Minkowski, Euclidean, Manhattan,

and Hamming distance. These three functions are used for

continuous function and Hamming for categorical variables.

KNN is a nonparametric classification technic for text

classification. First, identify the text, which classified for

finding the similarity. Next, using KNN to identify the number

of neighbor’s using K value for text classification. This k

value identification is important because it will affect the

accuracy for text classification [1]. Since the KNN is

computationally costly, variables should be normalised to

avoid biasing it. Prior to using kNN for things like outlier

detection and noise reduction, spend more time on the pre-

processing level.

C. Naïve Bayes

It's a classification method based on Bayes' theorem and the

presumption of predictor independence. A Naive Bayes

classifier, in simple terms, assumes that the existence of one

function in a class is unrelated to the presence of any other

feature. Naïve Bayes algorithm used for text classification

because of it gives more effectiveness and simplicity.

Compare with the simple naïve Bayes and multinomial naïve

Bayes have huge advantages and it gives more accuracy.

Naïve Bayes allows one to quantify the characteristics using

the conditional probability of two events happening,

depending on the probability of each actual occurrence.

Therefore, for a given text, we calculate the probability of

each word tag. In this theorem, bayes provides the calculation

of posterior probability P(c|x) from P(x),P(c) and P(x|c).

Bayes_rule equation given as

P(c|x)= (3)

P(c|x)= P(x1|c)* P(x2|c)* …….. P(xn|c)*P(c)

Where,

P(c|x) as the posterior probability of class (target) for the given

predictor (attribute).

P(c) as the class prior probability.

P(x|c) as the likelihood, which is the probability of a predictor

being assigned to a specific class.

P(x) as the predictor ‘s prior probability.

D. Support Vector Machine

SVM is one of the best-supervised machine learning

algorithms used for regression and classification but mainly

used for different problems in classification. In this every data

item placed in high dimensional space and after that identify

the hyper planes based on the features. These hyper planes are

used for differentiating the classes based on the classification.

SVM gives good precision but less in recall for text

classification.

The customized SVM improves the recall using different

threshold value. It gives the best result for precision and recall

for classification. In different ways of classification using

SVM improves efficiency and faster during runtime. For high

dimensional data, the SVM gives more accuracy because of it

controls the complexity and issues are fitted. SVM is a most

important text classification machine-learning algorithm with

linear transformation. It requires more resources for

classification compare with Naïve Bayes, but it gives more

accuracy and faster. SVM uses hyper plane for separating tags

(vectors) for input features.

E. Decision Tree

For classification problems, we are using the decision tree

method is one type in supervised learning algorithms. This

technique used for both continuous dependent variables and

categorical variables. The population fragmented into two or

more similar groups. This achievement based on the most

important independent variables/ attributes in order to create

as many distinct classes. The best feature for every node of the

tree can selected by using attribute selection measure. For this

attribute selection measure, we have two techniques; they are

Information gain and Gini index.

The decision tree constructs document manual

categorizations based on true or false outcomes. It forms the

leaf-based tree representing the document type, and branches

reflect the combination of features for those categories [2].

F. Neural Networks

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

For text classification, neural networks have been capable

of achieving significant performance in document modelling

and sentence. There are two types of neural networks: the

Convolutional Neural Network (CNN) and the Recurrent

Neural Network (RNN).

CNN filters into different windows with different sizes for

features of the documents using layers in CNN. These

windows used for classifying the features generating the word

tag vectors. The input layer data extract to different features

by using a kernel or filter that is convolution. The input data

use the dot product of vectors for convolution. These dot

product vector displayed result as zero when the text belongs

to same class. The embedded word vector used for finding the

similarity between vectors. This filter used to count the

number of filters in the convolution, and these are depending

on the us er’s requirement. After convolution, the filter

generates the multiple features, and these steps are coming

under feature extraction of input data based on the layers.

After generation of features use the activation function for

those filters and it displays the output with nonlinear

relationship over the output. These convolutions

dimensionally reduced to reduce the complexity and

computational time by using activation function. CNN can

improve by adding input layers used for more dense layers and

a smaller number of wide layers.

RNN is a sequence model to deal with variable length

document sequence and long-term dependencies. These two

NNs are using the long short-term memory network (LSTM).

This LSTM to extract higher-level sequences of tag features.

IV. PERFORMANCE EVALUAT ION MEASURES

The Text Classification Evaluation Measure is determined

according to accuracy, recall, and F-measure. Using a single

recall and accuracy to test the classifier is irrational. The

evaluation measures are depending on the classification

classes. The document belongs to class is dependent on some

parameters. Those parameters are true positive (TP), false

positive (FP), true negative (TP) and false negative (FP).

Using these parameters, the evaluation measures calculated as

shown below:

Accuracy= (4)

Recall= (5)

F-measure= (6)

The multi-class text classification technique was use the micro

average and macro average for classification evaluation

measure. The macro average is the mean of all classes F-

measure values.

Micro average= (7)

The efficiency of the classifier as measured by the above

parameters is as follows:

Table 1. Classification evaluation matrix

Belongs to class

Does not belongs

to class

Classification

belongs to group

TP/Correctly

identified

FP/ Incorrectly

identified

Classification does

not belongs to

group

FN/ Incorrectly

rejected

TN/ correctly

rejected

V. CONCLUSION

Text have major role in the present era because of information

is unstructured and different types of data. Many issues

confronted with text existence, it is difficult and time-

consuming to understand, interpret, sort and organize through

text data. Text classification algorithms used for classifying

the text in different applications like legal documents, emails,

social media, surveys, chatbots and more in a quick and

efficient way. This paper reviewed about text classification

using different machine learning algorithms as classifier

techniques. Comparison of different machine learning

algorithms, SVM gives more accuracy and less time for

computation. Decision tree is used for hierarchical type of

classification and KNN is gives more accuracy when know the

K- value and it is more efficient when the distance measure

value is known. Finally, the Deep Learning algorithms are

CNN and RNN gives more accuracy comparing with other

algorithms for huge and multiclass text documents. For multi

class classification, CNN gives best results and more efficient.

Evaluation measures used for finding the accuracy and recall

for identifying the best classifier mode for high accuracy and

less time consumption.

References

[1] Aizhang, Guo, and Yang T ao. 201 6. “Based on Rough Sets and the

Associated An alysis of KNN Text Classification Research.” Proceedings

- 14th Internat ional Symposium on Distributed Computing and

Applications for Business, Engineering and Science, DCABES 2015 (3):

485–88.

[2] Baharudin, Baharum, Lam Hong Lee, and Khairullah Khan. 20 10. “ A

Review of Machine Learning Algorithms for Text-Documents

Classification .” Journal of Advances in Information T echnology 1(1).

[3] Bh avan i, A., Dr. Nageswara Rao Moparthi, “SPEECH RECOGNITION

USING THE NN.” 1 1(12) 2020: 266 3–71.

[4] Cai, Jingjing, Jianping Li, Wei Li, and Ji Wang. 2 019. “ Deeplearnin g

Model Used in Text Classification.” 2018 15th International Comp uter

Conference on Wavelet Active Media Technology and Information

Processing, ICCWAMTIP 2018: 123 –26.

[5] Dhar, Ankita, Niladri Sekhar Dash, and Kaushik Roy. 2018.

“Classificat ion of Text Documents t hrough Distance Measurement: An

Experiment with Multi-Domain Bangla Text Do cuments.” Proceedings -

2017 3rd International Conference on Advances in Computing,

Communication and Automation (Fall), ICACCA 2017 2018-Janua: 1–

[6] Ikono makis, M., Sot os Kotsiant is, and V. Tampakas. 2005 . “ T ext

Classification Using Machine Learning T echn iques.” WSEAS

Transact ions on Computers 4(8): 966–74.

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

[7] Jin g, Ouy ang. 2020. “Research on English T ext Info rmation Filtering

Algorithm Based on SVM.” Pr oceedings of 2020 IEEE Int ernatio nal

Conference on Power, Intelligent Computing and Systems, ICPICS

2020: 1001–4.

[8] Kolluri, Johnson, an d Sh aik Razia. 2020. “ T ext Classification Using

Naïve Bayes Classifier.” Mat erials Today: Proceedings (xxxx).

https://doi.org/10.1016/j.matpr.2020.10.058.

[9] Li, Zhon ghao. 2019 . “ A Classification Retrieval App roach fo r English

Legal Texts.” Proceedings - 2019 International Conference on Intelligent

Transportation, Big Data and Smart City, ICITBS 2019: 220–23.

[10] Liu, Cai Zhi, Yan Xiu Sheng, Zhi Qiang Wei, and Yong Quan Yang.

201 8. “Research of T ext Classificat ion Based on Improved T F-IDF

Algorithm.” 2018 IEEE Internatio nal Conference of In telligen t Robotic

and Cont rol Engineering, IRCE 2018 (2): 69–73.

[11] Liu, Kan, and Lu Chen. 20 19. “ Medical Social Media T ext

Classification Integr ating Consumer Health Terminology.” IEEE Access

7: 78185–93.

[12] Lu, Hongbiao, Xiaobao Liu, Yanchao Yin, and Zhicheng Chen. 2019.

“A Pat ent Text Classificatio n Model Based on Multivariat e Neural

Network Fusion .” 2 019 6th International Conference on Soft Comput ing

and Machine Intelligence, ISCMI 2019: 61–65.

[13] Ma, Houfeng, Xinghua Fan, and Ji Chen. 2008. “An Incremental

Chinese Text Classification Algorithm Based on Quick Clustering.”

Proceedings - International Symposium on Information Processing, ISIP

2008 and International Pacific Workshop on Web Mining and Web-

Based Application, WMWA 2008: 308–12.

[14] Ma, Yajing, Yonghong Li, Xiaolong Wu, and Xiang Zhang. 2018.

“Chinese Text Classification Review.” Proceedings - 9th International

Conference on Information Technology in Medicine and Education,

ITME 2018: 737–39.

[15] Malakar, Susanta, and Werapon Chiracharit. 2020. “Thai Text Detection

and Classification Using Convolutional Neural Network.” 2 020 59t h

Annual Conference of the Society of Instrument and Control Engineers

of Japan, SICE 2020 (September): 99–102.

[16] Mirończuk, Marcin Michał, and Jarosław P rotasiewicz. 2018. “A Recent

Overview of the St ate-of-the-Art Elements of T ext Classificat ion.”

Expert Systems with Applications 106: 36–54.

[17] Myaeng, Sung Hyon, Kyoung Soo Han, and Hae Chang Rim. 2006.

“So me Effective Techniques for Naive Bayes Text Classification.” IEEE

Transactions on Knowledge and Data Engineering 18(11): 1457–66.

[18] Utomo, Muhammad Romi Ario, and Yuliant Sibaron i. 2019 . “ T ext

Classification of British English and American English Using Support

Vector Machine.” 2019 7th International Conference o n Information and

Communication T echnology, ICoICT 2019: 1–6.

[19] Ven kat esh , and K. V. Ran jitha. 2019. “Classification and Optimization

Scheme for Text Data Using Machine Learning Naïve Bayes Classifier.”

2018 IEEE World Symposium on Communication Engineering, WSCE

2018: 33–36.

[20] Y. Zheng, “An exploration on t ext classification with classical machine

learnin g algorithm,” Proc. - 2019 Int. Conf. Mach. Learn. Big Data Bus.

Intell. MLBDBI 2019, pp. 81–85, 2019, doi:

10.1109/MLBDBI48998.2019.00023.

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9

Feature selection based on chi-square and ant colony optimization for multi-label classification

Article

Full-text available

Jun 2024
IJECE

Text classification is widely used in organizations with large databases and digital documents. In text classification, there are many features, most of which are redundant. High-dimensional features impact multi-label classification performance. Feature selection is a data processing technique that can overcome this problem. Feature selection techniques have two major approaches: filter and wrapper. This paper proposes a hybrid filter-wrapper technique combining two algorithms: Chi-square (CS) and ant colony optimization (ACO). In the first stage, CS is used to reduce the number of irrelevant features. The ACO method is in the second stage. The ACO is applied to select the efficient features and improve classifier performance. The experiment results show that CS-ACO, CS-grey wolf optimizer (GWO), CS, and without feature selection (FS) have a micro F1-score based multinomial naïve Bayes classifier including 80%, 79.75%, 79.64% and 77.78%. The result indicates that the CS-ACO algorithm is suitable for solving multi-label classification problems.

Impacto del preprocesamiento en la clasificación automática de textos usando aprendizaje supervisado y reuters 21578

Article

Full-text available

Mar 2024

Ante la creciente generación de datos digitales, surgen retos en su gestión y categorización. Este estudio enfatiza en la clasificación automática de textos, poniendo especial énfasis en el impacto del preprocesamiento. Al emplear el conjunto de datos Reuters 21578 y aplicar algoritmos de aprendizaje supervisado como Random Forest, k-Vecinos Más Cercanos y Naïve Bayes, se analizó cómo técnicas como la tokenización y eliminación de palabras vacías influencian la precisión clasificatoria. Los hallazgos resaltan el valor agregado del preprocesamiento, destacando a "Random Forest" como el algoritmo óptimo, alcanzando una precisión del 92.2%. Este trabajo ilustra la potencialidad de combinar técnicas de preprocesamiento y algoritmos para mejorar la categorización de textos en la era digital.

Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Article

Full-text available

Feb 2024

Current student-centred, multilingual, active teaching methodologies require that teachers have continuous access to texts that are adequate in terms of topic and language competence. However, the task of finding appropriate materials is arduous and time consuming for teachers. To build on automatic readability assessment research that could help to assist teachers, we explore the performance of natural language processing approaches when dealing with educational science documents for secondary education. Currently, readability assessment is mainly explored in English. In this work we extend our research to Basque and Spanish together with English by compiling context-specific corpora and then testing the performance of feature-based machine-learning and deep learning models. Based on the evaluation of our results, we find that our models do not generalize well although deep learning models obtain better accuracy and F1 in all configurations. Further research in this area is still necessary to determine reliable characteristics of training corpora and model parameters to ensure generalizability.

Modelling a dense hybrid network model for fake review analysis using learning approaches

Article

Full-text available

Jan 2024
SOFT COMPUT

Users are now able to provide opinions and feedback in the form of reviews for any product, service, or business on social networking and e-commerce websites. Due to the significant user effect of reviews, spammers utilize phony reviews to elevate their organization or product and denigrate their rivals. On any given platform, it is thought that 14% of the reviews are fraudulent. To identify bogus reviews, several academics have put forth several strategies. The drawback of existing techniques is that they analyze the entire review text, which lengthens calculation times and reduces accuracy. In our suggested method, just these elements and their corresponding feelings are used for the detection of phony reviews. Aspects that have been retrieved are sent to CNN for learning. To detect false reviews, the reproduced attributes are input into long short-term memory (LSTM). As far as we are aware, despite the optimization it provides, aspects of replication and extraction are not employed to detect fake reviews, which is a big contribution from us. Performance comparisons with more modern methods are done using the Ott and Yelp Filter datasets. Analysis of the results of experiments shows that our suggested strategy beats current strategies. To demonstrate that dense hybrid network models (d-HNM) are superior to established machine learning techniques for difficult computing problems, our approach is also contrasted with others.

Predictive Modeling for Disaster Event Detection using Support Vector Machine

Conference Paper

Dec 2023

A Machine Learning Based Framework for Collecting and Using Social Media for Real-time Terrorist Attacks Prediction

Conference Paper

Full-text available

Dec 2023

Terrorism has become a global plague causing insecurity and jeopardizing the development of many countries. In the past few years, terrorism has exploded in Burkina Faso, affecting education, national security, health, and the economy. There is a great need for solutions to detect and stop terrorist attacks before they occur. This research project seeks to use Artificial Intelligence (AI) to mine social media and detect probable future terrorist attacks. This article describes the design of a framework, its partial implementation, and an experiment to validate the technique. The system consists of five steps: taking social media as input, converting it to text, validating it, extracting essential information, predicting its class, and storing it in a dataset. The modest size of the manually produced dataset utilized in the original experiment is a key drawback of the work discussed in this research. The modest size proved inconvenient for Deep Learning algorithms, which operate best with massive datasets. When we complete the entire system, inserting increased data from social media into the dataset will resolve this limitation. The other limitation is the partial implementation of the framework, which does not provide a comprehensive picture of the proposed approach. Our future works will address the remainder parts of the proposed framework.

Analysis of Real Time Twitter Sentiments using Deep Learning Models

Article

Dec 2023

Raed Alsini

Understanding attitudes regarding distinct topics and public opinions on the sentimental analysis of social media data is important. This research analyses the real-time twitter sentiments using deep learning. The major objective of the study is to create an efficient sentiment analysis algorithm to accurately ensure the sentiment polarity (positive, neutral or negative) of tweets. This study proposed a deep learning approach to capture the contextual information and complex patterns in social media data which leverages the power of neutral networks. To assess the performance of the algorithm the study relies on the evaluation of F1 score, accuracy, precision, and recall through rigorous evaluation metrics. The efficiency of the proposed approach is demonstrated by the numerical outcomes of the study. A novel contribution is provided with a specific emphasis on real-time Twitter sentiments by the study to enhance the sentiment analysis techniques for social media data. The significant implication from accurate and timely analysis of Twitter sentiments for several applications includes public opinion tracking, brand management, customer feedback analysis, and reputation monitoring. The potential to provide significant insights to researchers, organisations and business can be made from promptly addressing the sentiments expressed on real time data of twitter.

A hybrid technique for arabic text classification using semi-supervised learning

Conference Paper

Jan 2024

Personality Prediction Based on Tweets of Russo-Ukrainian Conflict in Social Networks

Conference Paper

Dec 2023

Myanmar Sports Sub-News Categorization Using Chi-Square, Bag of Words, and Cosine Similarity Combining with Smooth TF-IDF

Conference Paper

Nov 2023

SPEECH RECOGNITION USING THE NN

Article

Full-text available

Dec 2020

Speech is a natural and primary mode of communication for people. Speech Recognition Technology gives machines the ability to identify and respond to spoken commands.. Deep Learning could be a subfield of machine learning uses neural networks for recognizing spoken words and converts them to text. Hidden Markov Models (HMM), states utilize a combination of Gaussian to model a spectral illustration of the wave. HMM was utilized for speech recognition with poor accuracy and less efficient in the way of the time domain. In proposed system, HMM can be replaced by Convolutional 1Dimensional Neural Networks (CNN) to increase efficiency and accuracy. HMMs are least efficient which led to the use of Deep Neural Networks (DNN). DNN methods can handle nonlinear data for speech analysis has ability to minimize error rate. The speech recognition model selects the best speech signal illustration by feature extraction of the audio signal within the Time domain as speech is single-dimensional will be sometimes processed victimisation sliding windows that are fed into a network. Conv1D handles speech signals by providing a full frequency feature vector at every instant which completely describe the sample.

Medical Social Media Text Classification Integrating Consumer Health Terminology

Article

Full-text available

Jan 2019

In recent years, advances in technologies, such as machine learning, natural language processing, and automated data processing, have offered potential biomedical and public health applications that use massive data sources, e.g., social media. However, current methods are underutilized for features including consumer health terminology in social media texts. In this paper, we proposed a medical social media text classification (MSMTC) algorithm that integrates consumer health terminology. Classification of text from social media on medical subjects is divided into two sub-tasks: consumer health terminology extraction and text classification. First, text characteristics based on the double channel structure are used for training, and consumer health terminology is subsequently extracted-based using an adversarial network. Then, text classification is implemented based on the extracted consumer health terminology and double channel subtraction method. This paper takes datasets containing patient descriptions from social media as an example. The experimental results show that the algorithm outperforms single channel methods or baseline models, including Convolutional Neural Networks, Long Short-Term Memory Networks, Bi-directional Long Short-Term Memory Networks, Naive Bayesian Model, and Extreme Gradient Boosting.

Thai Text Detection and Classification Using Convolutional Neural Network

Conference Paper

Full-text available

Sep 2020

Many foreign people don’t know Thai language and most of the time Thai sign images, posters or text images do not have subtitles in English so, it is very necessary to have a system that can translate Thai text to English. In this paper, MSER and convolutional neural network (CNN) have used to understand Thai text in English. Firstly, region of interest has localized from natural image which is some particular Thai text. Then text has extracted and fed to CNN. We used a 7 layer self designed CNN model that provides the output with an accuracy of 98%. The proposed system takes natural scene image as input and uses MSER, geometrical properties as well as bounding box algorithm to localize the text area then selected localized areas have fed to CNN and provide an output that has the English meaning for the Thai text image. This paper introduces a new approach of text translation by using image classification method. The proposed system can work on particular inputs which are indoor sign Thai text images.

Text Classification of British English and American English Using Support Vector Machine

Conference Paper

Full-text available

Jul 2019

Text classification using Naïve Bayes classifier

Article

Nov 2020

In all the previous applications where the data plays an important role such as universities, businesses, research institutions, technology-intensive companies, and government funding agencies, maintaining irregular data is a big challenge. Text classification using machine learning and deep learning models is used to organize documents or data in a predefined set of classes/groups. So once the data is trained using the deep learning algorithms, the trained model will be able to identify, predict and detect the data for categorizing it in classes/groups/topics. It is very useful in Web content management, Search engines email filtering, spam detection, intent detection, topic labeling, tagging, categorization of data and sentiment analysis, etc.

Research on English Text Information Filtering Algorithm Based on SVM

Conference Paper