ArticlePDF Available

Text classification with machine learning algorithms

Authors:

Abstract and Figures

By increasing the access to electronic documents and rapid growth of World Wide Web, documents classification task automatically has become a key method to organizing information and knowledge discovery. The appropriate classification of electronic documents, online news, weblogs, emails and digital libraries required for text mining, machine learning techniques and natural language processing is to obtain meaningful knowledge. The aim of this paper is to highlight the major techniques and methods applied in classification of documents. In this paper, we review some existing methods of text classification.
Content may be subject to copyright.
J. Basic. Appl. Sci. Res., 3(1s)31-35, 2013
© 2013, TextRoad Publication
ISSN 2090-4304
Journal of Basic and Applied
Scientific Research
www.textroad.com
Corresponding author: Nasim VasfiSisi, Department of Computer, Shabestar Branch, Islamic Azad University, Shabestar,
Iran. Email address: Nasim_vasfi@yahoo.com
Text Classification with Machine Learning Algorithms
Nasim VasfiSisi1 and Mohammad Reza Feizi Derakhshi2
1Department of Computer, Shabestar Branch, Islamic Azad University, Shabestar, Iran
2Department of Computer, University of Tabriz, Tabriz, Iran Received: June 10 2013
Accepted: July 7 2013
ABSTRACT
By increasing the access to electronic documents and rapid growth of World Wide Web, documents
classification task automatically has become a key method to organizing information and knowledge discovery.
The appropriate classification of electronic documents, online news, weblogs, emails and digital libraries
required for text mining, machine learning techniques and natural language processing is to obtain meaningful
knowledge. The aim of this paper is to highlight the major techniques and methods applied in classification of
documents. In this paper, we review some existing methods of text classification.
KEYWORDS: Text mining, text classification, machine learning algorithms, classifiers.
1. INTRODUCTION
In recent years, a dramatic growth has taken place in volume of text documents over internet, news sources
and intranet throughout companies where the classification of these documents is required. The text automatic
classification task is to use text documents for predefined classes of documents which could help in both well
organization and finding information over these great resources. This work has several applications including
automatic indexing of scientific articles based on predefined store of terminologies, archive inventions
submission in inventions list book, spam filtering, identify different types of documents, automatic grading of
articles and authorship documents and electronic government’s repositories, articles news, biological databases,
chat rooms, online associations, electronic mails and weblog pools [2].
Automatic classification of documents helps organizations to get rid of manual classification and also
manual classification could be expensive and time consuming. The precision of modern text classification
systems has become a competitor for professional trained people and as a result it is a combination of
information retrieval technologies and machine learning technologies [2,3].
Today, text classification gives an individual challenge due to excess of existing features in datasets and
excess of training samples and dependent features which lead to development of different types of text
classification algorithms [10].
In text classification each document is placed either in none of the classes, in multiple classes or in one
class. The mail goal of using machine learning methods is that the classifier learns the learning from the samples
which previously have been classified in the previous classed automatically [1]. For example, we can label each
of the automatically received news by a subject like “sport”, “politic” or “art”. Classification of a dataset like
d=(d1,… dn) starts from labeled classes, c1, c2,…, cn (such as sport, political and etc) and then the same process
is performed to determine a classification model which is able to determine the suitable class for a new
document d from the text classification domain which has one label or multiple labels. Documents with one
label belong to only one class and multiple labels belong to more than one class [4].
In this paper, we will have documents pre -processing steps in section two, different types of text
classification methods are presented in section three and finally in section four we will have conclusion.
2. Pre-processing document
The first step in text classification is to transform documents into a string of characters with various
formats which is represented for learning and classification algorithms. Always, it is better to find the word’s
root in information retrieval so that the word could be applied as a unit in documents and this unit word lead to
representation of feature value in the text. Each separate word has one feature, where the value of this feature
equals to the number of word occurrence in documents. To eliminate unnecessary feature vectors some words
are considered as features which have occurred at least three times in training data and are not included in stop
words [1]. Fig. 1 represents the text classification process:
VasfiSisi and Derakhshi 2013
Fig 1. represents the text classification process [1]:
We briefly describe the fig. 1:
a) Read Document step: at first all of documents are read.
b) Tokenize text step: in this step the text is broken into tokens, meaningful words, terms, phrases,
symbols or elements which is called Tokenization.
c) Stemming: step: the root of words is transformed into an original form.
d) Stop words step: words such as in, this, a, an, the, with and etc are removed.
e) Vector Representation of Text: In this step, a algebraic model is defined to represent text
documents as a vector. Because the main goal of feature selection methods is to reduce
dimensionality.
f) Feature Selection and/or Feature Transformation: In this step we reduce the dimensions of datasets
using feature selection methods by removing the features not related to classification. After
documents feature selection, according to the flexibility, we can use machine learning algorithms
such as Genetic algorithm, Neural Network, Rule Induction, Fuzzy Decision Tree, SVM, K-NN (K-
Nearest Neighbor) algorithm, Lsa, Rocchio algorithm and Naïve Bayesian [1].
Machine learning, natural language processing (NLP) and data mining techniques work for automatic
classification and discovery of electronic documents’ patterns. The main goal of text mining is to allow
users to extract information from text resources and deal with actions like retrieval, classification
(supervised, unsupervised and pseudo supervised) and summarization [3].
Development of computer hardware provides the adequate strength of computations in order to allow
text classification to be used in applications. Text classification is usually used to deal with spam emails,
classify large text collections in to subjective classes, knowledge management and also help to internet
search engines [6].
3. Classifiers
3.1 SVM algorithm
The standard SVM (Support Vector Machines) has been purposed by Cortes and Vapnik in 1995 [8].
SVM is one of the supervised learning methods used for classification and regression. SVM classification
method is from arithmetic learning theory based on Structural Risk Minimization principle. The idea of this
principle is to find a hypothesis to guarantee the least error. SVM requires two positive and negative training
sets which is unusual for other classification methods. This positive and negative training set is necessary for
SVM to search a decision level in order to separate positive and negative data within n-dimensional space in a
best way which is called hyper plane. Therefore, SVM creates a hyper plane or a set of large surfaces in a space
with high dimensions [2,3].
In general, a useful separator for distance is obtained by a hyper plane which has the highest distance from
neighbor training data points of both classed (which is called margin) and the highest margin produces the least
error of classification [8]: In SVM method it is attempted to reduce the number of points classified wrongly and
the logical way to goal consistence is as equation (1) [2]:
Tokenize
Text
Stemming
Vector Representation of
Text Delete
Stopwords
Feature Selection and/or
Feature Transformation Learning
algorithm
Read
Document
31
32
J. Basic. Appl. Sci. Res., 3(1s)31-35, 2013
)1(
l
ii
C
w1
2
2
1
min
i
b
xi
w
y
i
ts 1...
3.2 Neural network algorithm
Neural network classification is a network of units, where input units usually represent words and output
unit(s) represents a class or the label of class. for classifying a test document, the weight of words is determined
for input units and activation of these units is performed through forward propagation in the network and the
value of output unit is determined as a result in decision of classes. Some researches use single-layer perceptron,
since its implementation is simple and multi-layer perceptron that is so complex requires an extensive
implementation for classification. Using an effective feature selection method to reduce dimensionality
improves efficiency in this method. The documents classification methods based on newly presented neural
networks is so useful in companies to evident management of documents [4].
3.3 k-NN (K-Nearest Neighbor ) algorithm
K-NN is a case-based learning method and is one of the simplest machine learning algorithms. In this
algorithm, an example with majority vote from neighbor is classified and this example is determined in the most
general class among k nearest neighbors. K is a positive integer and typically small. If k=1, then the example is
simply assigned to the class of its nearest neighbor. The oddness of k is useful, since by this method, the equal
vote is prevented [5]. K-NN has an application for most methods, since it is effective, non-parametric and has
simple implementation, whereas its classification time is longer and it is difficult to find the optimal k value.
The best selection from k depends on data, in general the high value of k reduces the noise effect on
classification, but the margin among classes is differentiated less [4]. Fig. 2 is an example of K-NN
classification algorithm [7]:
Fig 2. Example of K-NN classification algorithm [7].
Fig. 2 is an example of K-NN classification algorithm by using multi-dimensional feature vector where
triangles represent the first class and squares show the second class. The small circle shows the test example.
Now, if k=3 then the test example belongs to triangle class and if k=5, the example belongs to square class [5].
The training steps of this algorithm are as follows: this algorithm classifies a test document based on k
nearest neighbor. The training examples are introduced as vectors in multi-dimensional feature space. The space
is portioned into areas with training examples. A point in the space is assigned to a class in which the most
training points belonging to that class within the K nearest training example, usually, Euclidean distance or
Cosine similarity are used in this method. In classification phase, a test example is represented as a vector in
feature space and Euclidean distance or Cosine similarity of test vector with whole training vectors is measured
and the K nearest training example is selected. Of course, there are many ways to classify test vector and
therefore, the classic K-NN algorithm determines a test example based on the maximum votes from the k
nearest neighbors. Three main factors in K-NN algorithm are as follows [7]:
1. Distance or similarity criterion to find the K-Nearest Neighbor.
2. K is the number of nearest neighbors.
3. The decision rule is to determine a class for test document from k nearest neighbors.
33
VasfiSisi and Derakhshi 2013
3.4 Decision Tree
Decision Tree is a classification algorithm whose structure is based on “if-then” classification rules. In this
method, at first we must determine the possible events and draw the tree from the root node. Each node
describes a value taken from gain function [9].
In a decision tree, leaves show similar class of documents and branches represent the conjunction of
features related to that class. A well-structured decision tree can place the class of a document simply in the root
node of tree and allow performing the query structure until reaching a certain leaf which represents the aim of
document. Fig. 3 represents a decision tree classification algorithm [3].
Fig 3. An example a Decision Tree [3]:
The decision tree classification method has dominant advantages over other decision support means. The
main advantage of decision tree is its understanding and interpretation even for non-expert users. Furthermore,
the interpretation of obtained results could be done conveniently by using a simple mathematical algorithms.
Decision tree could experimentally show that the iteration of text classification includes so many appropriate
and related features. An application of decision tree is to personalize advertisement in web pages. A major risk
in implementation of a decision tree is to over fit of training data with the occurrence of an alternative tree that
categorizes the training data worse but would categorizes the document to be categorized better [3].
3.5 Bayesian classification
The Bayesian classification is a simple possibility classification based on an application of Bayesian
theorem with strong independent hypothesis. Description of probabilistic model is independent from description
of features model. The features independency hypothesis makes the order to features unimportant and as a
result, now one feature does not influence on other features in classification. These hypotheses have resulted in
effectiveness of Bayesian classification method’s computation, but this hypothesis limits its application
significantly. According to the precise nature of probabilistic model, the Bayesian classifier could be trained
more effectively with relatively low requirement of training data in order to estimate the necessary parameters
for classification, since we have assumed parameters independent, it is only necessary to determine the variance
of variants for each class, but not covariance of whole matrix [3].
4. Conclusion
Various algorithms or a combination of hybrid algorithms have been purposed for automatic classification
of documents. The Bayesian classification is used well in filtering spam and emails and text classification and
requires a few numbers of training data to estimate essential parameters for classification. Bayesian
classification performs well over text and numerical data and has convenient implementation in comparison
with other algorithms.
Although the hypothesis of conditional independency is contradicted by real world’s data and when the
feature are so dependent to each other it performs so weak and it does not have centralization in the words
occurrence abundance. The advantage of Bayesian classification is that it requires a few training data to estimate
the necessary parameters for classification and its main disadvantage is the relatively low efficiency of
classification in comparison with other detection algorithms.
SVM classification has been known as one the most effective text classification methods in comparison
with supervised machine learning algorithms and provides a perfect precision, but in this case recollection is
reduced. SVM takes the main features from data and replaces it with Structural Risk Minimization (SRM)
principle to minimize the upper bound in error generalization and also, capability to learn could be independent
from feature vector dimensions. K-NN algorithm performs well when so local features of documents are
introduced, while the classification time is longer in this method and it is difficult to find the optimal value to k.
The major advantage of decision tree is its simplicity of understanding.
34
J. Basic. Appl. Sci. Res., 3(1s)31-35, 2013
REFERENCES
1. Bhavani Dasari, D. and Gopala Rao. K, V., Text Categorization and Machine Learning Methods,
Current State of the Art, Global Journal of Computer Science and Technology Software & Data
Engineering, 2012. 12(11).
2. LIU, X. and FU, H., A Hybrid Algorithm for Text Classification Problem, 2011. Przegląd
Eektrotechniczny (Electrical Review).
3. Khan, A. , Baharudin, B., Hong Lee, L. and Khan, Kh., A Review of Machine Learning
Algorithms for Text-Documents Classification, Journal of Advances in Information Technology,
2010. 1(1).
4. Korde, V. and Mahender, C N. , Text Classification and classifiers, A survey, International Journal
of Artificial Intelligence & Applications (IJAIA), 2012. 3(2).
5. Ananthi, S. and, Sathyabama, S. , Spam Filtering Using K-NN, Journal of Computer Applications,
2009. 2(3).
6. Mahinovs, A. and Tiwari, A., Text Classification Method Review. Decision Engineering Report
Series, 2007.
7. Miah, M. , Improved k-NN Algorithm for Text Classification, In Proceedings of DMIN:2009. P.
434-440.
8. Xiao.li, CH., Pei.yu, L. , Zhen.fang, Z. and Ye, Q., A Method of Spam Filtering Based on
Weighted Support Vector Machines, IEEE International Symposium on IT in Medicine &
Education, 2009. 1.
9. Naksomboon, S. , Charnsripinyo, C. and Wattanapongsakorn, N., Considering Behavior of Sender
in Spam Mail Detection. International Conference on Networked Computing (INC 2010), 2010.
Gyeongju, South Korea.
10. Han, E. H. S. and Karypis, G., Centroid-based document classification, Analysis and experimental
results, 2000, Springer Berlin Heidelberg. p. 424-431.
35
... La cantidad de documentos de diversos tipos disponibles en una organización es enorme y continúa creciendo cada día. Estos documentos, más que las bases de datos, son a menudo un repositorio fundamental del conocimiento [1], [2]. La humanidad produce tanta información que su catalogación manual ya no es posible; esto obliga al desarrollo de herramientas automatizadas, apoyando a las personas en el procesamiento de la información [3]. ...
... La tarea de clasificación automática de texto consiste en utilizar documentos de texto con clases o categorías predefinidas para organizar nuevos documentos en estas mismas categorías [4], [5], [2]. Esta actividad ha acumulado un estatus importante en el campo de los sistemas de información debido a la mayor disponibilidad de documentos en formato digital, por otra parte, es necesario acceder a ellos de manera fácil [5]. ...
... La precisión de los modernos sistemas de clasificación de textos se ha convertido en un competidor para personas capacitadas profesionalmente. Estos sistemas son el resultado de una combinación de tecnologías de recuperación de información y tecnologías de aprendizaje automático [2]. ...
Conference Paper
Full-text available
La información en ésta época ha ganado mucha importancia para las personas y para las empresas, sin embargo, el exceso de información puede convertirse en un problema cuando no se organiza de una forma adecuada. Es común descargar documentos de Internet y tenerlos en distintos lugares de la computadora, al cabo de un tiempo hay muchos archivos desorganizados. La propuesta de esta investigación emplea las categorías y documentos de Wikipedia para clasificar documentos de texto en las mismas categorías que la enciclopedia provee. Se hizo un experimento para encontrar al mejor clasificador de documentos bajo las condiciones de Wikipedia, el mejor rendimiento lo tuvo la máquina de soporte vectorial. Después, en otro experimento se clasificaron documentos con el sistema prototipo desarrollado obteniendo un rendimiento de $0.84$. Este trabajo establece que los artículos de Wikipedia son una fuente para clasificar documentos con un rendimiento confiable.
... Multimodal prediction and learning are also popular in other fields of AI [23,24]. Using different sentiment analysis and text classification methods [25,26] can be investigated even in the text related to user comments. In this paper, the multimodal approach has two data types: graphical features and non-graphical features. ...
Article
Full-text available
Price prediction is one of the examples related to forecasting tasks and is a project based on data science. Price prediction analyzes data and predicts the cost of new products. The goal of this research is to achieve an arrangement to predict the price of a cellphone based on its specifications. So, five deep learning models are proposed to predict the price range of a cellphone, one unimodal and four multimodal approaches. The multimodal methods predict the prices based on the graphical and non-graphical features of cellphones that have an important effect on their valorizations. Also, to evaluate the efficiency of the proposed methods, a cellphone dataset has been gathered from GSMArena. The experimental results show 88.3% F1-score, which confirms that multimodal learning leads to more accurate predictions than state-of-the-art techniques.
... This process difficult to be manual with huge number of documents so automatic classification is better than manual classification because it has more accuracy and time efficiency (N. VasfiSisi et al., 2013 andNidhi et al., 2011). Natural language processing, data mining, and machine learning techniques work together to automatically classify documents. ...
... In the first case documents SO is calculated from the words or phrases SO [13]. In the second case classifiers are built from annotated instances of text or sentences also described as a statistical or machine learning approach [14,15]. Supervised machine learning methods based classifiers have gained high accuracy in detection of text polarity [31], but performance of machine learning is domain dependent [32]. ...
Article
Full-text available
Sentiment analysis is a compelling issue for both information producers and consumers. We are living in the " age of customer " , where customer knowledge and perception is a key for running successful business. The goal of sentiment analysis is to recognize and express emotions digitally. This paper presents the lexicon-based framework for sentiment classification, which classifies tweets as a positive, negative, or neutral. The proposed framework also detects and scores the slangs used in the tweets. The comparative results show that the proposed system outperforms the existing systems. It achieves 92% accuracy in binary classification and 87% in multi-class classification.
Article
Full-text available
Sentiment analysis and opinion mining is closely coupled with each other. An extensive research work is being carried out in these areas by using different methodologies. Sentiments in a given text are identified by these methodologies as either positive, negative or neutral. Tweets, facebook posts, user comments about certain topics and reviews regarding product, software and movies can be the good source of information. Sentiment Analysis techniques can be used on such data by businesses executives for future planning and forecasting. As the data is obtained from multiple sources and it depends directly on the user which can be from any part of the world so the noisiness in data is a common issue such as mistake in spellings, grammatical errors and improper punctuation. Different approaches are available for sentiment analysis which can automatically sort and categorize the data. These approaches are mainly categorized as Machine Learning based, Lexicon based and Hybrid. A hybrid approach is the combination of machine learning and lexicon based approach for the optimum results, this approach generally yields better results. In this research work different hybrid techniques and tools have been discussed and analyzed from different aspects.
Conference Paper
Gender prediction on social media data set is usually tackled as a text classification problem and can be solved using machine learning methods such as K-nearest neighbor algorithm (KNN). However, KNN is computationally costly due to its lazy learning pattern; it does not perform well when the dimension of feature space is high. Dimension reduction methods are thus introduced and integrated into KNN to save the computation time. In this paper we proposed an approach which combines the Latent Semantic Indexing (LSI) method to KNN to predict the gender based on a real life collection of posts on actual blog pages. Its effectiveness in processing large scale and high dimensional data is demonstrated by experimental results.
Article
Full-text available
With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing techniques to get meaningful knowledge. The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques. This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing literature.
Article
In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroid-based classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.
Article
This paper investigates a novel algorithm-EGA-SVM for text classification problem by combining support vector machines (SVM) with elitist genetic algorithm (GA). The new algorithm uses EGA, which is based on elite survival strategy, to optimize the parameters of SVM. Iris dataset and one hundred pieces of news reports in Chinese news are chosen to compare EGA-SVM, GA-SVM and traditional SVM. The results of numerical experiments show that EGA-SVM can improve classification performance effectively than the other algorithms. This text classification algorithm can be extended easily to apply to literatures in the field of electrical engineering.
Article
As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare the some existing classifier on basis of few criteria like time complexity, principal and performance.
Conference Paper
Recently, the number of spam mails is exponentially growing. It affects the costs of organizations and annoying the e-mail recipient. Spammers always try to find the way to avoid filtering out from the email system. At the same time, as an email recipient or network system/administrator, we try to have an effective spam mail filtering technique to catch the spam mails. The problems of spam mail filtering are that each user has different perspective toward spam mails; so there are many types of spam mails, while the challenge is how to detect the various types and forms of spam mails. In this paper, behaviors of spammers are used to customize the filtering rule. The information from the spam messages also can be used to filter spam mails and it can give higher accuracy than the keyword-based method does. We propose a spam classification approach using Random Forest algorithm. Spam Assassin Corpus is selected as a database for classification. It consists of 6,047 email messages, where 4,150 of them are the legitimate messages and the other 1,897 messages are the spam mails.
Conference Paper
Over the last twenty years, text classification has become one of the key techniques for organizing electronic information such as text and web documents. The k-Nearest Neighbor (k-NN) algorithm is a very well known and popular algorithm for text classification. The k-NN algorithm determines the classification of new document by the class of its k-nearest neighbor. In this paper we propose an improved k-NN algorithm with a built-in technique to skip a document from training corpus without looking inside the document if it is not important, which improves the performance of the algorithm. It also has an improved decision rule to identify class from k-nearest neighbor to improve the accuracy by avoiding bias of dominating class with large number of documents. We conduct experiments on benchmark text classification datasets. The new and improved k-NN algorithm is suitable for other applications as well.
Text Categorization and Machine Learning Methods, Current State of the Art
  • D Bhavani Dasari
  • Gopala K Rao
Bhavani Dasari, D. and Gopala Rao. K, V., Text Categorization and Machine Learning Methods, Current State of the Art, Global Journal of Computer Science and Technology Software & Data Engineering, 2012. 12(11).
Spam Filtering Using K-NN
  • S Ananthi
  • S Sathyabama
Ananthi, S. and, Sathyabama, S., Spam Filtering Using K-NN, Journal of Computer Applications, 2009. 2(3).
  • A Mahinovs
  • A Tiwari
Mahinovs, A. and Tiwari, A., Text Classification Method Review. Decision Engineering Report Series, 2007.
A Method of Spam Filtering Based on Weighted Support Vector Machines
  • Xiao
  • C H Li
  • Pei
  • L Yu
  • Zhen
  • Z Fang
  • Q Ye
Xiao.li, CH., Pei.yu, L., Zhen.fang, Z. and Ye, Q., A Method of Spam Filtering Based on Weighted Support Vector Machines, IEEE International Symposium on IT in Medicine & Education, 2009. 1.