Conference PaperPDF Available

A Review of State Art of Text Classification Algorithms

Authors:

Abstract

In the recent years, the categorization of text documents into predefined classifications has perceived a growing interest due to the growing of documents in digital form and needs to organize them. Text categorization is one of the extensively used for natural language processing (NLP) applications have achieved using machine learning algorithms. Text classification is a challenging researcher to find the best suitable structure and technique. Classification process done using manual and automatic classification. This research paper covers the preprocessing, feature extraction, different algorithms and techniques for text classification and finally evaluates the performance metrics for assessment.
A Review of State Art of Text Classification
Algorithms
A.Bhavani
Assistant Professor/Computer Science & Engineering
GMR Institute of Technology
Rajam, Andhra Pradesh, India
bhavaniashapu@gmail.com
Dr. B. Santhosh Kumar, SMIEEE
Professor/ Computer Science & Engineering,
Guru Nanak Institute of Technology,
Hyderabad, Telangana, India
drbsanthoshkumar@ieee.org
Abstract In the recent years, the categorization of text
documents into predefined classifications has perceived a
growing interest due to the growing of documents in digital form
and needs to organize them. Text categorization is one of the
extensively used for natural language processing (NLP)
applications have achieved using machine learning algorithms.
Text classification is a challenging researcher to find the best
suitable structure and technique. Classification process done
using manual and automatic classification. This research paper
covers the preprocessing, feature extraction, different algorithms
and techniques for text classification and finally evaluates the
performance metrics for assessment.
KeywordsText Classification; Preprocessing; Feature
Extraction; Classification Algorithms; Evaluation Masures;
Natural Language Processing; Deep Learning Models;
I. INTRODUCTION
Text classification is the technique of categorizing text or
tags into organized classes according to its content. Text
classification also called text tagging or text categorization. It
is one of the significant tasks in natural language processing
(NLP) with extensive applications consisting of intent
detection, topic labeling, spam detection, and sentiment
analysis. By using NLP, Text classifiers may automatically
identify text after which a pre-described collection of tags or
categories are allocated according to their topics from medical
studies, files, documents and all over the world. Although the
classifier defines the category of textual content are classified,
the similarity of all inputs in the training sets should be
determined. Thus, classification performance assessment will
fall down with the rise of training inputs [1]. The classification
of the automated arrangement of documents became the main
function for knowledge discovery and information
organization as the availability of electronic documents and
the growth of the internet increased. [2].
Natural language processing (NLP), machine learning
techniques and data mining used to identify and discover
patterns in electronic documents automatically. Text mining's
main purpose is to enable users to extract information from
textual tools and to deal with operations. The aim of
Information Extraction (IE) methods is to extract precise data
from text documents. This is the first approach, which
suggests that text mining is the same as data extraction. The
process of finding documents that provide answers to
questions known as information retrieval (IR). To achieve this,
statistical measurements and methods for automated text data
processing and reference to the given query are used. In its
largest context, information retrieval encompasses the entire
spectrum of information processing, from data retrieval to
knowledge retrieval. Text classification contains the steps for
processing classification are followed as pre-processing,
feature extraction and classification algorithm models for
training and prediction as shown in figure1. Training Data set
are used for train the data and prediction is used for find the
labelled output after classification of text document with
different machine learning algorithms as different classifier
models.
Fig.1. Text classification Process
A. Pre-processing
Pre-processing is the important task for classification of text in
natural language processing. There are three main components
for processing. Those are Tokenization, Normalization and
Noise removal.
Tokenization: Tokenization involves strings that divided into
smaller tokens. It is possible to tokenize paragraphs into
sentences and tokenize sentences into words.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1493
Normalization: Normalization objects for eliminate common
and frequent words like the, a etc. and remove irrelevant
words.
Noise removal: Noise removal used to removes the unwanted
text and it converts different form of words into similar
canonical words [3].
Fig.2. Text classification Preprocessing
B. Feature extraction
The important task of text classification is feature extraction.
It gives not only more accuracy and gives greater probability,
saves computational time. Feature extractions have different
characteristics are word bag and embedding and extract the
different characteristics from text [4]. The features converted
into different vectors as shown in fig. 1. The vectors are
counted in the form of TF-IDF Vectors.TF-IDF is a Term
Frequency and Inverse Document Frequency for two-word
vectors [5]. For TF, the frequency of occurrence in a text of a
word t, d is determined and assigned to the same word as
weight. The value of word in the text document calculated by
the IDF. By using the following equation, the IDF of word t
was determined
IDFt= (1)
The number of documents shows where N is. Now, using the
following equation, the TF-IDF formula for word t are
estimated in document d
TF-IDF = TF *IDFt (2)
After preprocessing and feature extraction, going to
the next step is classify the text documents using Machine
learning classification techniques. Based on these algorithms,
documents classified to single class or multi class labels.
Finally, the performance evaluation measures used for finding
the accuracy and recall. Based on these two measures, can
identify which classification algorithm performed well and
takes less time for computation for classified documents.
II. LITERATURE SURVEY
Compared to the old text classification algorithm, the
KNN algorithm has less accuracy and reduces the time
complexity, significantly increases the efficacy of text
classification for classifying the text. As compared to other
machine learning algorithms, Naive Bayes performs well on
textual and numeric data, but its conditional independence
principle violated by real-life data, it ignores the frequency of
word occurrences, and it performs poorly when features are
strongly correlated. So, this algorithm used for email
classification and spam filtering.
In machine learning algorithms for text classification,
SVM classification is the most effective algorithm comparing
with all supervised Machine Learning algorithms. SVM for
better capture the in-built characteristics of the data.
Moreover, the ability to learn can be independent of the
function space's dimensionality and global vs. local minima.
When it comes to kernel selection and parameter tuning, the
SVM has some difficulties. In comparison of other algorithms,
KNN and Naïve bayes are suitable for pre-processing. KNN is
suitable for continues to achieve good results, in the case of
increasing the number of documents, but this is not possible
with SVM.
In KNN difficult to find K optimal value and it takes
more classification time. In addition, more work is required
for improvement of performance and classification of
document accuracy. In increased electronic documents, new
algorithms and solutions used for knowledge beneficial. Here,
rough set method used with KNN text classification algorithm
that is improved KNN text classification algorithm. Finding
similarity based on distance measures to improve
classification performance [6].
This paper presents the importance of different text
classification techniques and challenges to solve with these
algorithms. It shows about variant techniques in feature
selection for finding the better accuracy. This semantic and
ontology methods are used for information retrieval and
document classification. The paper explains about
classification evaluates the performances of the classifiers
based on different training text and high or good quality
training measures may originate classifiers of good evaluation
of performance [7].
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1494
Jing Ouyang says that Support Vector Machine (SVM)-
based text classification and device data parameters improve
when specific experiments conducted, so the classification
algorithm significantly improves the efficiency and accuracy
of the classifier for text classification [8].
The authors explains about different applications are
categorized into topic classification, which is classify text
based on topic and relevance or aspect, and sentiment analysis
for detecting sentiment in text gives the positive or negative as
result and intent classification is classify text based on intent
for example feedback, request and complaint [9].
The following parameters used for measuring the
performance of classifiers: True positive (TP) as correctly
identified, False positive (FP) as incorrectly identified, True
negative (TN) as correctly rejected, False negative (FN) as
incorrectly rejected [3]. Precision, recall, F-Measure, Micro,
and Macro Average used in the text classification evaluation
index [10].
These parameters are used for medical data set, in this the
algorithm used different modules for classification wither
more accuracy. These methods of feature learning module
having word embedding layer, positional encoding, attention
mechanism, gradient reversal layer and adversarial network
for word extraction module based on adversarial network and
text classification based on dual channel subtraction [11].
This paper, two distance measure techniques used for
better accuracy for bangla text. Different classification
algorithms use input text divided into domains such as
business, state, science, sports, and medicine. The
classification techniques are Euclidean distance, cosine
similarity, J48, Naïve bayes, naïve bayes multinomial, random
forest and simple logistic. The Euclidean distance and cosine
similarity generates best accuracy like 95.20% and 95.80%
[13]. This paper uses the TF, IDF, and CHI to find the
classification of legal provisions and characteristics words.
Here, the class belongs to and does not belongs to classes are
defined with A, B, C and D for finding the measures. The
English legal text giving the input parameters are society,
military, sports, finance, IT, property, education,
entertainment are used for traditional TF-IDF and improved
TF-IDF algorithms gives better performance [14].
For patent text classification, multivariate neural network
fusion used. In this parameters are Model parameter Word
embedding dimension, Convolution kernel size set, Number of
BIGRU nodes, Dropout, Batch size, Epoch size, Optimizer
and Loss having the Parameter values like 300,3,4,5, 128
0.5, 64, 100, rmsprop, Categorical crossentrop [15].
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Chinese Text Short Text English T ext News
Naive Bayes
KNN
Random Forest
SVM
CNN
RNN
Fig.3. Comparison of accuracy for ML techniques
Unsupervised learning algorithm is clusters used for text
classification, removing noisy data, and compared to Bayes
algorithms for incremental learning algorithms; the cluster
algorithm improves efficiency and improves the accuracy of
incremental learning. [12]. For English text, the author used
two different SVM algorithms are SVM and improved SVM.
Therefore, in this the improved SVM has the following
characteristics. Those are good fault tolerance, based on the
user requirement filter the information, stable process running,
very few system bugs, speed of the system also good, and
takes less time for processing.
In text classification, pre-processing is the first and
primary phase. The pre-processing for dataset having the
different steps followed by plaintext as input data, data set
partition as divided into different word bags. The data
vectorization as different n-gram of data, divided into training
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1495
set and test set after that build the training dataset for multiple
algorithms. Finally predicting the test set based on the result of
training set and calculate the accuracy for all algorithms. In
this paper, textCNN, TextRNN and TextRNN + Attention
methods used for text classification. The text having different
forms with special sequence and this context maintain the
syntax and semantic dependence. In this paper, the
implemented algorithms are deep learning algorithms of
conventional neural network (CNN) and recurrent neural
network algorithm (RNN).
The RNN technique can deal with semantic dependences
to attain significant classification but cannot parallelize well.
So this algorithm is suitable for short text. For long text, it
takes more training time when the processing text. The CNN
can parallelize well for achieving classification. This is
suitable for long text and different size of filter of text. This
algorithm uses multiple channels and the most influential can
select using max pooling. For feature selection, it uses a
dropout of full connection layer for getting the result of
classification. These algorithms are generated the result of
accuracy like CNN gives 85% and RNN gives 82%. Based on
Text detection and image detection, we can also do the text
classification using convolutional Neural Networks [16].
The classification of text has more advanced, and it
requires six key components. The components are data
collection, data processing and labelling, extraction and
weighting of features, selection or projection of features,
model training and evaluation of solutions. For the fast text
data the algorithms SVM, KNN, Naïve bayes had generated an
accuracy of 91%, 88% and 86%. [17].
In feature extraction, the word tags divided into unigram
and bigram word for easily finding the accuracy and fast
classification of text document using TF-IDF. Using British
English and American English, the algorithm SVM generate
92.1% of accuracy in stemming test compared with normal
text. While unigram text has given accuracy of 91.2%
compared with bigram and N-gram, by using linear kernel
accuracy 87.1% was generated compared with different kernel
algorithms, taking threshold value from 1 to 10 highest
frequency of 94.0% was generated at threshold value 2 [18].
The Hadoop Map Reduce facing many limitations in the
field of text classification with K-Means. This paper proposed
hadoop Map Reduce with Naïve Bayes classifier algorithm for
reducing these problems. This classifier is much better for
time and space complexity. It is easy to work and simple. In
the medical data set, for hadoop map reduce 47.85%, for
Gaussian naïve Bayes 72.13%, for Bernoulli naive bays
65.5%, multinomial naïve Bayes 52.45% of accuracy
generated. [19].
In this paper, the traditional TF-IDF algorithm cannot
return the dissemination of feature words in different
categories. Therefore, the author improved this algorithm as
improved TF-IDF algorithm and this method implemented
with weighting factor, which gives the degree of inter class
distribution and intra class distribution, and association
between class and feature words. This paper used the
Word2vec model for represent the vectors of text word in
place of vector space. This implemented with word vector and
weight of word for Word2vec model. Micro average and
macro averages used for performance evaluation. The micro
average gives more accuracy than macro average for both
traditional and improved TF-IDF. The improved TF-IDF gives
more accuracy for classification with comparison of traditional
TF-IDF and produces the result more effectively [20].
The above authors discussed about different machine
learning algorithms and deep learning algorithms. These
algorithms are applied for Chinese text, short text, English text
and news text and those domains are generated the accuracy as
shown as above fig 3. Nowadays, mostly and updated text
classification algorithms are deep learning techniques with sub
domains and these techniques gives more accuracy and takes
less time for computation. Finally, these algorithms are
appropriate for large amount of data with multiclass text
classification.
III. TEXT CLASSIFICATION ALGORITHMS
Machine learning algorithms classified into unsupervised,
supervised, and semi-supervised approaches used to classify
documents. Many techniques and algorithms for clustering
and classification of electronic documents recently proposed.
Using the current literature, this section focused on supervised
classification methods, emerging technologies, and some of
the opportunities and challenges. As the internet use rate has
rapidly increased, the automated sorting of documents into
predefined categories has been observed as an active attention.
The role of automatic text classification extensively studied in
recent years, and rapid progress appears to make in this field,
including Machine Learning approaches such as K-Nearest
Neighbors (KNN), Naïve Bayes, Support Vector Machine
(SVM), Decision Tree, Neural Networks, Convolutional
Neural Network (CNN), and Recurrent Neural Network
(RNN). Automatic text classification typically employs
supervised learning techniques, in which pre-defined category
labels assigned to documents based on the probability
indicated by a training collection of labelled documents. Some
of these methods shown below:
Fig.4. Machine Learning Algorithms
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1496
A. Linear Regression
It is the basic and primary method in machine learning for
text classification. The linear regression used for finding the
estimate real values based on independent variables. For
fitting the linear line, we are giving the relationship between
dependent variables and independent variables. This best
linear line is called as regression line and the equation was
represented as y=mx+c.
Where y as dependent variable,
x as independent variable,
m as intercept slope and
c intercept constant.
The m and c coefficients calculated by minimizing the
number of squared differences in distance between regression
line and data points. Linear regression classified into two
categories. Those are simple linear regression and multiple
linear regressions. Simple linear regression is derived using
one independent variable. Multiple linear regressions
calculated by using more than one independent variable. We
are using curve linear regression or polynomial regression for
fitting the linear regression line.
B. K-Nearest Neighbors (K-NN)
KNN is the Machine Learning algorithm for regression and
classification problems. The KNN algorithm is a simple
method that stores all available cases and classifies new cases
based on a majority vote of its k neighbours. The case that
assigned to the class is the most common among its K closest
neighbours, as determined by a distance function. These
distance functions can be Minkowski, Euclidean, Manhattan,
and Hamming distance. These three functions are used for
continuous function and Hamming for categorical variables.
KNN is a nonparametric classification technic for text
classification. First, identify the text, which classified for
finding the similarity. Next, using KNN to identify the number
of neighbor’s using K value for text classification. This k
value identification is important because it will affect the
accuracy for text classification [1]. Since the KNN is
computationally costly, variables should be normalised to
avoid biasing it. Prior to using kNN for things like outlier
detection and noise reduction, spend more time on the pre-
processing level.
C. Naïve Bayes
It's a classification method based on Bayes' theorem and the
presumption of predictor independence. A Naive Bayes
classifier, in simple terms, assumes that the existence of one
function in a class is unrelated to the presence of any other
feature. Naïve Bayes algorithm used for text classification
because of it gives more effectiveness and simplicity.
Compare with the simple naïve Bayes and multinomial naïve
Bayes have huge advantages and it gives more accuracy.
Naïve Bayes allows one to quantify the characteristics using
the conditional probability of two events happening,
depending on the probability of each actual occurrence.
Therefore, for a given text, we calculate the probability of
each word tag. In this theorem, bayes provides the calculation
of posterior probability P(c|x) from P(x),P(c) and P(x|c).
Bayes_rule equation given as
P(c|x)= (3)
P(c|x)= P(x1|c)* P(x2|c)* ….. P(xn|c)*P(c)
Where,
P(c|x) as the posterior probability of class (target) for the given
predictor (attribute).
P(c) as the class prior probability.
P(x|c) as the likelihood, which is the probability of a predictor
being assigned to a specific class.
P(x) as the predictor ‘s prior probability.
D. Support Vector Machine
SVM is one of the best-supervised machine learning
algorithms used for regression and classification but mainly
used for different problems in classification. In this every data
item placed in high dimensional space and after that identify
the hyper planes based on the features. These hyper planes are
used for differentiating the classes based on the classification.
SVM gives good precision but less in recall for text
classification.
The customized SVM improves the recall using different
threshold value. It gives the best result for precision and recall
for classification. In different ways of classification using
SVM improves efficiency and faster during runtime. For high
dimensional data, the SVM gives more accuracy because of it
controls the complexity and issues are fitted. SVM is a most
important text classification machine-learning algorithm with
linear transformation. It requires more resources for
classification compare with Naïve Bayes, but it gives more
accuracy and faster. SVM uses hyper plane for separating tags
(vectors) for input features.
E. Decision Tree
For classification problems, we are using the decision tree
method is one type in supervised learning algorithms. This
technique used for both continuous dependent variables and
categorical variables. The population fragmented into two or
more similar groups. This achievement based on the most
important independent variables/ attributes in order to create
as many distinct classes. The best feature for every node of the
tree can selected by using attribute selection measure. For this
attribute selection measure, we have two techniques; they are
Information gain and Gini index.
The decision tree constructs document manual
categorizations based on true or false outcomes. It forms the
leaf-based tree representing the document type, and branches
reflect the combination of features for those categories [2].
F. Neural Networks
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1497
For text classification, neural networks have been capable
of achieving significant performance in document modelling
and sentence. There are two types of neural networks: the
Convolutional Neural Network (CNN) and the Recurrent
Neural Network (RNN).
CNN filters into different windows with different sizes for
features of the documents using layers in CNN. These
windows used for classifying the features generating the word
tag vectors. The input layer data extract to different features
by using a kernel or filter that is convolution. The input data
use the dot product of vectors for convolution. These dot
product vector displayed result as zero when the text belongs
to same class. The embedded word vector used for finding the
similarity between vectors. This filter used to count the
number of filters in the convolution, and these are depending
on the us er’s requirement. After convolution, the filter
generates the multiple features, and these steps are coming
under feature extraction of input data based on the layers.
After generation of features use the activation function for
those filters and it displays the output with nonlinear
relationship over the output. These convolutions
dimensionally reduced to reduce the complexity and
computational time by using activation function. CNN can
improve by adding input layers used for more dense layers and
a smaller number of wide layers.
RNN is a sequence model to deal with variable length
document sequence and long-term dependencies. These two
NNs are using the long short-term memory network (LSTM).
This LSTM to extract higher-level sequences of tag features.
IV. PERFORMANCE EVALUAT ION MEASURES
The Text Classification Evaluation Measure is determined
according to accuracy, recall, and F-measure. Using a single
recall and accuracy to test the classifier is irrational. The
evaluation measures are depending on the classification
classes. The document belongs to class is dependent on some
parameters. Those parameters are true positive (TP), false
positive (FP), true negative (TP) and false negative (FP).
Using these parameters, the evaluation measures calculated as
shown below:
Accuracy= (4)
Recall= (5)
F-measure= (6)
The multi-class text classification technique was use the micro
average and macro average for classification evaluation
measure. The macro average is the mean of all classes F-
measure values.
Micro average= (7)
The efficiency of the classifier as measured by the above
parameters is as follows:
Table 1. Classification evaluation matrix
Belongs to class
Classification
belongs to group
TP/Correctly
identified
Classification does
not belongs to
group
FN/ Incorrectly
rejected
V. CONCLUSION
Text have major role in the present era because of information
is unstructured and different types of data. Many issues
confronted with text existence, it is difficult and time-
consuming to understand, interpret, sort and organize through
text data. Text classification algorithms used for classifying
the text in different applications like legal documents, emails,
social media, surveys, chatbots and more in a quick and
efficient way. This paper reviewed about text classification
using different machine learning algorithms as classifier
techniques. Comparison of different machine learning
algorithms, SVM gives more accuracy and less time for
computation. Decision tree is used for hierarchical type of
classification and KNN is gives more accuracy when know the
K- value and it is more efficient when the distance measure
value is known. Finally, the Deep Learning algorithms are
CNN and RNN gives more accuracy comparing with other
algorithms for huge and multiclass text documents. For multi
class classification, CNN gives best results and more efficient.
Evaluation measures used for finding the accuracy and recall
for identifying the best classifier mode for high accuracy and
less time consumption.
References
[1] Aizhang, Guo, and Yang T ao. 201 6. “Based on Rough Sets and the
Associated An alysis of KNN Text Classification Research.” Proceedings
- 14th Internat ional Symposium on Distributed Computing and
Applications for Business, Engineering and Science, DCABES 2015 (3):
48588.
[2] Baharudin, Baharum, Lam Hong Lee, and Khairullah Khan. 20 10. “ A
Review of Machine Learning Algorithms for Text-Documents
Classification .” Journal of Advances in Information T echnology 1(1).
[3] Bh avan i, A., Dr. Nageswara Rao Moparthi, “SPEECH RECOGNITION
USING THE NN.” 1 1(12) 2020: 266 371.
[4] Cai, Jingjing, Jianping Li, Wei Li, and Ji Wang. 2 019. “ Deeplearnin g
Model Used in Text Classification.” 2018 15th International Comp uter
Conference on Wavelet Active Media Technology and Information
Processing, ICCWAMTIP 2018: 123 26.
[5] Dhar, Ankita, Niladri Sekhar Dash, and Kaushik Roy. 2018.
Classificat ion of Text Documents t hrough Distance Measurement: An
Experiment with Multi-Domain Bangla Text Do cuments.” Proceedings -
2017 3rd International Conference on Advances in Computing,
Communication and Automation (Fall), ICACCA 2017 2018-Janua: 1
6.
[6] Ikono makis, M., Sot os Kotsiant is, and V. Tampakas. 2005 . “ T ext
Classification Using Machine Learning T echn iques.” WSEAS
Transact ions on Computers 4(8): 96674.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1498
[7] Jin g, Ouy ang. 2020. Research on English T ext Info rmation Filtering
Algorithm Based on SVM.” Pr oceedings of 2020 IEEE Int ernatio nal
Conference on Power, Intelligent Computing and Systems, ICPICS
2020: 10014.
[8] Kolluri, Johnson, an d Sh aik Razia. 2020. T ext Classification Using
Naïve Bayes Classifier.” Mat erials Today: Proceedings (xxxx).
https://doi.org/10.1016/j.matpr.2020.10.058.
[9] Li, Zhon ghao. 2019 . “ A Classification Retrieval App roach fo r English
Legal Texts.” Proceedings - 2019 International Conference on Intelligent
Transportation, Big Data and Smart City, ICITBS 2019: 22023.
[10] Liu, Cai Zhi, Yan Xiu Sheng, Zhi Qiang Wei, and Yong Quan Yang.
201 8. Research of T ext Classificat ion Based on Improved T F-IDF
Algorithm.” 2018 IEEE Internatio nal Conference of In telligen t Robotic
and Cont rol Engineering, IRCE 2018 (2): 6973.
[11] Liu, Kan, and Lu Chen. 20 19. Medical Social Media T ext
Classification Integr ating Consumer Health Terminology.” IEEE Access
7: 7818593.
[12] Lu, Hongbiao, Xiaobao Liu, Yanchao Yin, and Zhicheng Chen. 2019.
A Pat ent Text Classificatio n Model Based on Multivariat e Neural
Network Fusion .” 2 019 6th International Conference on Soft Comput ing
and Machine Intelligence, ISCMI 2019: 6165.
[13] Ma, Houfeng, Xinghua Fan, and Ji Chen. 2008. “An Incremental
Chinese Text Classification Algorithm Based on Quick Clustering.”
Proceedings - International Symposium on Information Processing, ISIP
2008 and International Pacific Workshop on Web Mining and Web-
Based Application, WMWA 2008: 30812.
[14] Ma, Yajing, Yonghong Li, Xiaolong Wu, and Xiang Zhang. 2018.
Chinese Text Classification Review.” Proceedings - 9th International
Conference on Information Technology in Medicine and Education,
ITME 2018: 73739.
[15] Malakar, Susanta, and Werapon Chiracharit. 2020. Thai Text Detection
and Classification Using Convolutional Neural Network.” 2 020 59t h
Annual Conference of the Society of Instrument and Control Engineers
of Japan, SICE 2020 (September): 99102.
[16] Mirończuk, Marcin Michał, and Jarosław P rotasiewicz. 2018. “A Recent
Overview of the St ate-of-the-Art Elements of T ext Classificat ion.”
Expert Systems with Applications 106: 3654.
[17] Myaeng, Sung Hyon, Kyoung Soo Han, and Hae Chang Rim. 2006.
So me Effective Techniques for Naive Bayes Text Classification.” IEEE
Transactions on Knowledge and Data Engineering 18(11): 145766.
[18] Utomo, Muhammad Romi Ario, and Yuliant Sibaron i. 2019 . “ T ext
Classification of British English and American English Using Support
Vector Machine.” 2019 7th International Conference o n Information and
Communication T echnology, ICoICT 2019: 16.
[19] Ven kat esh , and K. V. Ran jitha. 2019. “Classification and Optimization
Scheme for Text Data Using Machine Learning Naïve Bayes Classifier.”
2018 IEEE World Symposium on Communication Engineering, WSCE
2018: 3336.
[20] Y. Zheng, “An exploration on t ext classification with classical machine
learnin g algorithm,” Proc. - 2019 Int. Conf. Mach. Learn. Big Data Bus.
Intell. MLBDBI 2019, pp. 8185, 2019, doi:
10.1109/MLBDBI48998.2019.00023.
Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)
DVD Part Number: CFP21K25-DVD: ISBN: 978-0-7381-1203-9
978-0-7381-1204-6/21/$31.00 ©2021 IEEE 1499
... This process helps reduce word variations and enables better semantic analysis; iv) feature extraction. Text extraction transforms text data into numerical features that could be used for machine learning or information retrieval tasks [28]. ...
Article
Full-text available
Text classification is widely used in organizations with large databases and digital documents. In text classification, there are many features, most of which are redundant. High-dimensional features impact multi-label classification performance. Feature selection is a data processing technique that can overcome this problem. Feature selection techniques have two major approaches: filter and wrapper. This paper proposes a hybrid filter-wrapper technique combining two algorithms: Chi-square (CS) and ant colony optimization (ACO). In the first stage, CS is used to reduce the number of irrelevant features. The ACO method is in the second stage. The ACO is applied to select the efficient features and improve classifier performance. The experiment results show that CS-ACO, CS-grey wolf optimizer (GWO), CS, and without feature selection (FS) have a micro F1-score based multinomial naïve Bayes classifier including 80%, 79.75%, 79.64% and 77.78%. The result indicates that the CS-ACO algorithm is suitable for solving multi-label classification problems.
... En [21] se describen como herramientas para cuantificar la eficacia y el rendimiento de modelos y algoritmos en el ámbito del aprendizaje automático. Proporcionan una perspectiva objetiva sobre la capacidad de un modelo, valorando su precisión, confiabilidad y adaptabilidad a contextos reales. ...
Article
Full-text available
Ante la creciente generación de datos digitales, surgen retos en su gestión y categorización. Este estudio enfatiza en la clasificación automática de textos, poniendo especial énfasis en el impacto del preprocesamiento. Al emplear el conjunto de datos Reuters 21578 y aplicar algoritmos de aprendizaje supervisado como Random Forest, k-Vecinos Más Cercanos y Naïve Bayes, se analizó cómo técnicas como la tokenización y eliminación de palabras vacías influencian la precisión clasificatoria. Los hallazgos resaltan el valor agregado del preprocesamiento, destacando a "Random Forest" como el algoritmo óptimo, alcanzando una precisión del 92.2%. Este trabajo ilustra la potencialidad de combinar técnicas de preprocesamiento y algoritmos para mejorar la categorización de textos en la era digital.
... Previous studies have identified SVM classification as the most effective among supervised machine learning algorithms, showcasing substantial improvements in classification results compared to other methods (Liu et al., 2010). Additionally, SVM demonstrates a strong ability to capture the inherent characteristics of the data (Bhavani & Kumar, 2021). On the other hand, DL approaches have also shown improved accuracy in readability tasks (Azpiazu & Pera, 2019;Imperial, 2021). ...
Article
Full-text available
Current student-centred, multilingual, active teaching methodologies require that teachers have continuous access to texts that are adequate in terms of topic and language competence. However, the task of finding appropriate materials is arduous and time consuming for teachers. To build on automatic readability assessment research that could help to assist teachers, we explore the performance of natural language processing approaches when dealing with educational science documents for secondary education. Currently, readability assessment is mainly explored in English. In this work we extend our research to Basque and Spanish together with English by compiling context-specific corpora and then testing the performance of feature-based machine-learning and deep learning models. Based on the evaluation of our results, we find that our models do not generalize well although deep learning models obtain better accuracy and F1 in all configurations. Further research in this area is still necessary to determine reliable characteristics of training corpora and model parameters to ensure generalizability.
... These false verdicts are too intricate for robots to comprehend. Because spammers are always coming up with new strategies to avoid them, modern algorithms similarly struggle to detect bogus reviews (Bhavani and Santhosh Kumar 2021). To screen out fake reviews, online businesses must continuously enhance their filtering systems. ...
Article
Full-text available
Users are now able to provide opinions and feedback in the form of reviews for any product, service, or business on social networking and e-commerce websites. Due to the significant user effect of reviews, spammers utilize phony reviews to elevate their organization or product and denigrate their rivals. On any given platform, it is thought that 14% of the reviews are fraudulent. To identify bogus reviews, several academics have put forth several strategies. The drawback of existing techniques is that they analyze the entire review text, which lengthens calculation times and reduces accuracy. In our suggested method, just these elements and their corresponding feelings are used for the detection of phony reviews. Aspects that have been retrieved are sent to CNN for learning. To detect false reviews, the reproduced attributes are input into long short-term memory (LSTM). As far as we are aware, despite the optimization it provides, aspects of replication and extraction are not employed to detect fake reviews, which is a big contribution from us. Performance comparisons with more modern methods are done using the Ott and Yelp Filter datasets. Analysis of the results of experiments shows that our suggested strategy beats current strategies. To demonstrate that dense hybrid network models (d-HNM) are superior to established machine learning techniques for difficult computing problems, our approach is also contrasted with others.
... In Equation (3) With this technique, SVM is one of the best-supervised learning algorithms used in regression and classification problems, especially in classification problems in various fields [24]. SVM also has several parameters that can be changed according to the dataset's characteristics. ...
... Considering similar research works, including [24], [22], [25], we have identified accuracy (acc), precision (p), recall (r), and F-Measure (FM) as the proper metrics to evaluate our ML algorithms. These metrics are briefly described, and their computation formulas are provided in the respective equations. ...
Conference Paper
Full-text available
Terrorism has become a global plague causing insecurity and jeopardizing the development of many countries. In the past few years, terrorism has exploded in Burkina Faso, affecting education, national security, health, and the economy. There is a great need for solutions to detect and stop terrorist attacks before they occur. This research project seeks to use Artificial Intelligence (AI) to mine social media and detect probable future terrorist attacks. This article describes the design of a framework, its partial implementation, and an experiment to validate the technique. The system consists of five steps: taking social media as input, converting it to text, validating it, extracting essential information, predicting its class, and storing it in a dataset. The modest size of the manually produced dataset utilized in the original experiment is a key drawback of the work discussed in this research. The modest size proved inconvenient for Deep Learning algorithms, which operate best with massive datasets. When we complete the entire system, inserting increased data from social media into the dataset will resolve this limitation. The other limitation is the partial implementation of the framework, which does not provide a comprehensive picture of the proposed approach. Our future works will address the remainder parts of the proposed framework.
... While our model gives certainty values, taking into account the abstract idea of feelings and the chance of misinterpretation is fundamental. Feeling examination is a difficult errand, and the model's forecasts ought to be utilised as a device to help human judgement instead of a conclusive measure [2]. To sum up, our examination grandstands the viability of utilising profound learning models to dissect feelings in Twitter information. ...
Article
Understanding attitudes regarding distinct topics and public opinions on the sentimental analysis of social media data is important. This research analyses the real-time twitter sentiments using deep learning. The major objective of the study is to create an efficient sentiment analysis algorithm to accurately ensure the sentiment polarity (positive, neutral or negative) of tweets. This study proposed a deep learning approach to capture the contextual information and complex patterns in social media data which leverages the power of neutral networks. To assess the performance of the algorithm the study relies on the evaluation of F1 score, accuracy, precision, and recall through rigorous evaluation metrics. The efficiency of the proposed approach is demonstrated by the numerical outcomes of the study. A novel contribution is provided with a specific emphasis on real-time Twitter sentiments by the study to enhance the sentiment analysis techniques for social media data. The significant implication from accurate and timely analysis of Twitter sentiments for several applications includes public opinion tracking, brand management, customer feedback analysis, and reputation monitoring. The potential to provide significant insights to researchers, organisations and business can be made from promptly addressing the sentiments expressed on real time data of twitter.
Article
Full-text available
Speech is a natural and primary mode of communication for people. Speech Recognition Technology gives machines the ability to identify and respond to spoken commands.. Deep Learning could be a subfield of machine learning uses neural networks for recognizing spoken words and converts them to text. Hidden Markov Models (HMM), states utilize a combination of Gaussian to model a spectral illustration of the wave. HMM was utilized for speech recognition with poor accuracy and less efficient in the way of the time domain. In proposed system, HMM can be replaced by Convolutional 1Dimensional Neural Networks (CNN) to increase efficiency and accuracy. HMMs are least efficient which led to the use of Deep Neural Networks (DNN). DNN methods can handle nonlinear data for speech analysis has ability to minimize error rate. The speech recognition model selects the best speech signal illustration by feature extraction of the audio signal within the Time domain as speech is single-dimensional will be sometimes processed victimisation sliding windows that are fed into a network. Conv1D handles speech signals by providing a full frequency feature vector at every instant which completely describe the sample.
Article
Full-text available
In recent years, advances in technologies, such as machine learning, natural language processing, and automated data processing, have offered potential biomedical and public health applications that use massive data sources, e.g., social media. However, current methods are underutilized for features including consumer health terminology in social media texts. In this paper, we proposed a medical social media text classification (MSMTC) algorithm that integrates consumer health terminology. Classification of text from social media on medical subjects is divided into two sub-tasks: consumer health terminology extraction and text classification. First, text characteristics based on the double channel structure are used for training, and consumer health terminology is subsequently extracted-based using an adversarial network. Then, text classification is implemented based on the extracted consumer health terminology and double channel subtraction method. This paper takes datasets containing patient descriptions from social media as an example. The experimental results show that the algorithm outperforms single channel methods or baseline models, including Convolutional Neural Networks, Long Short-Term Memory Networks, Bi-directional Long Short-Term Memory Networks, Naive Bayesian Model, and Extreme Gradient Boosting.
Conference Paper
Full-text available
Many foreign people don’t know Thai language and most of the time Thai sign images, posters or text images do not have subtitles in English so, it is very necessary to have a system that can translate Thai text to English. In this paper, MSER and convolutional neural network (CNN) have used to understand Thai text in English. Firstly, region of interest has localized from natural image which is some particular Thai text. Then text has extracted and fed to CNN. We used a 7 layer self designed CNN model that provides the output with an accuracy of 98%. The proposed system takes natural scene image as input and uses MSER, geometrical properties as well as bounding box algorithm to localize the text area then selected localized areas have fed to CNN and provide an output that has the English meaning for the Thai text image. This paper introduces a new approach of text translation by using image classification method. The proposed system can work on particular inputs which are indoor sign Thai text images.
Article
In all the previous applications where the data plays an important role such as universities, businesses, research institutions, technology-intensive companies, and government funding agencies, maintaining irregular data is a big challenge. Text classification using machine learning and deep learning models is used to organize documents or data in a predefined set of classes/groups. So once the data is trained using the deep learning algorithms, the trained model will be able to identify, predict and detect the data for categorizing it in classes/groups/topics. It is very useful in Web content management, Search engines email filtering, spam detection, intent detection, topic labeling, tagging, categorization of data and sentiment analysis, etc.