ArticlePDF Available

Text classification with machine learning algorithms

March 2013

March 2013
3(1):31-35

Authors:

University of Tabriz

By increasing the access to electronic documents and rapid growth of World Wide Web, documents classification task automatically has become a key method to organizing information and knowledge discovery. The appropriate classification of electronic documents, online news, weblogs, emails and digital libraries required for text mining, machine learning techniques and natural language processing is to obtain meaningful knowledge. The aim of this paper is to highlight the major techniques and methods applied in classification of documents. In this paper, we review some existing methods of text classification.

Example of K-NN classification algorithm [7].

…

An example a Decision Tree [3]:

…

Figures - uploaded by Mohammad Reza Feizi Derakhshi

Content may be subject to copyright.

Content uploaded by Mohammad Reza Feizi Derakhshi

Content may be subject to copyright.

J. Basic. Appl. Sci. Res., 3(1s)31-35, 2013

ISSN 2090-4304

Journal of Basic and Applied

Scientific Research

www.textroad.com

Corresponding author: Nasim VasfiSisi, Department of Computer, Shabestar Branch, Islamic Azad University, Shabestar,

Iran. Email address: Nasim_vasfi@yahoo.com

Text Classification with Machine Learning Algorithms

Nasim VasfiSisi1 and Mohammad Reza Feizi Derakhshi2

1Department of Computer, Shabestar Branch, Islamic Azad University, Shabestar, Iran

2Department of Computer, University of Tabriz, Tabriz, Iran Received: June 10 2013

Accepted: July 7 2013

ABSTRACT

By increasing the access to electronic documents and rapid growth of World Wide Web, documents

classification task automatically has become a key method to organizing information and knowledge discovery.

The appropriate classification of electronic documents, online news, weblogs, emails and digital libraries

required for text mining, machine learning techniques and natural language processing is to obtain meaningful

knowledge. The aim of this paper is to highlight the major techniques and methods applied in classification of

documents. In this paper, we review some existing methods of text classification.

KEYWORDS: Text mining, text classification, machine learning algorithms, classifiers.

1. INTRODUCTION

In recent years, a dramatic growth has taken place in volume of text documents over internet, news sources

and intranet throughout companies where the classification of these documents is required. The text automatic

classification task is to use text documents for predefined classes of documents which could help in both well

organization and finding information over these great resources. This work has several applications including

automatic indexing of scientific articles based on predefined store of terminologies, archive inventions

submission in inventions list book, spam filtering, identify different types of documents, automatic grading of

articles and authorship documents and electronic government’s repositories, articles news, biological databases,

chat rooms, online associations, electronic mails and weblog pools [2].

Automatic classification of documents helps organizations to get rid of manual classification and also

manual classification could be expensive and time consuming. The precision of modern text classification

systems has become a competitor for professional trained people and as a result it is a combination of

information retrieval technologies and machine learning technologies [2,3].

Today, text classification gives an individual challenge due to excess of existing features in datasets and

excess of training samples and dependent features which lead to development of different types of text

classification algorithms [10].

In text classification each document is placed either in none of the classes, in multiple classes or in one

class. The mail goal of using machine learning methods is that the classifier learns the learning from the samples

which previously have been classified in the previous classed automatically [1]. For example, we can label each

of the automatically received news by a subject like “sport”, “politic” or “art”. Classification of a dataset like

d=(d1,… dn) starts from labeled classes, c1, c2,…, cn (such as sport, political and etc) and then the same process

is performed to determine a classification model which is able to determine the suitable class for a new

document d from the text classification domain which has one label or multiple labels. Documents with one

label belong to only one class and multiple labels belong to more than one class [4].

In this paper, we will have documents pre -processing steps in section two, different types of text

classification methods are presented in section three and finally in section four we will have conclusion.

2. Pre-processing document

The first step in text classification is to transform documents into a string of characters with various

formats which is represented for learning and classification algorithms. Always, it is better to find the word’s

root in information retrieval so that the word could be applied as a unit in documents and this unit word lead to

representation of feature value in the text. Each separate word has one feature, where the value of this feature

equals to the number of word occurrence in documents. To eliminate unnecessary feature vectors some words

are considered as features which have occurred at least three times in training data and are not included in stop

words [1]. Fig. 1 represents the text classification process:

VasfiSisi and Derakhshi 2013

Fig 1. represents the text classification process [1]:

We briefly describe the fig. 1:

a) Read Document step: at first all of documents are read.

b) Tokenize text step: in this step the text is broken into tokens, meaningful words, terms, phrases,

symbols or elements which is called Tokenization.

c) Stemming: step: the root of words is transformed into an original form.

d) Stop words step: words such as in, this, a, an, the, with and etc are removed.

e) Vector Representation of Text: In this step, a algebraic model is defined to represent text

documents as a vector. Because the main goal of feature selection methods is to reduce

dimensionality.

f) Feature Selection and/or Feature Transformation: In this step we reduce the dimensions of datasets

using feature selection methods by removing the features not related to classification. After

documents feature selection, according to the flexibility, we can use machine learning algorithms

such as Genetic algorithm, Neural Network, Rule Induction, Fuzzy Decision Tree, SVM, K-NN (K-

Nearest Neighbor) algorithm, Lsa, Rocchio algorithm and Naïve Bayesian [1].

Machine learning, natural language processing (NLP) and data mining techniques work for automatic

classification and discovery of electronic documents’ patterns. The main goal of text mining is to allow

users to extract information from text resources and deal with actions like retrieval, classification

(supervised, unsupervised and pseudo supervised) and summarization [3].

Development of computer hardware provides the adequate strength of computations in order to allow

text classification to be used in applications. Text classification is usually used to deal with spam emails,

classify large text collections in to subjective classes, knowledge management and also help to internet

search engines [6].

3. Classifiers

3.1 SVM algorithm

The standard SVM (Support Vector Machines) has been purposed by Cortes and Vapnik in 1995 [8].

SVM is one of the supervised learning methods used for classification and regression. SVM classification

method is from arithmetic learning theory based on Structural Risk Minimization principle. The idea of this

principle is to find a hypothesis to guarantee the least error. SVM requires two positive and negative training

sets which is unusual for other classification methods. This positive and negative training set is necessary for

SVM to search a decision level in order to separate positive and negative data within n-dimensional space in a

best way which is called hyper plane. Therefore, SVM creates a hyper plane or a set of large surfaces in a space

with high dimensions [2,3].

In general, a useful separator for distance is obtained by a hyper plane which has the highest distance from

neighbor training data points of both classed (which is called margin) and the highest margin produces the least

error of classification [8]: In SVM method it is attempted to reduce the number of points classified wrongly and

the logical way to goal consistence is as equation (1) [2]:

Tokenize

Text

Stemming

Vector Representation of

Text Delete

Stopwords

Feature Selection and/or

Feature Transformation Learning

algorithm

Read

Document

J. Basic. Appl. Sci. Res., 3(1s)31-35, 2013

)1(



l

min













ts  1...

3.2 Neural network algorithm

Neural network classification is a network of units, where input units usually represent words and output

unit(s) represents a class or the label of class. for classifying a test document, the weight of words is determined

for input units and activation of these units is performed through forward propagation in the network and the

value of output unit is determined as a result in decision of classes. Some researches use single-layer perceptron,

since its implementation is simple and multi-layer perceptron that is so complex requires an extensive

implementation for classification. Using an effective feature selection method to reduce dimensionality

improves efficiency in this method. The documents classification methods based on newly presented neural

networks is so useful in companies to evident management of documents [4].

3.3 k-NN (K-Nearest Neighbor ) algorithm

K-NN is a case-based learning method and is one of the simplest machine learning algorithms. In this

algorithm, an example with majority vote from neighbor is classified and this example is determined in the most

general class among k nearest neighbors. K is a positive integer and typically small. If k=1, then the example is

simply assigned to the class of its nearest neighbor. The oddness of k is useful, since by this method, the equal

vote is prevented [5]. K-NN has an application for most methods, since it is effective, non-parametric and has

simple implementation, whereas its classification time is longer and it is difficult to find the optimal k value.

The best selection from k depends on data, in general the high value of k reduces the noise effect on

classification, but the margin among classes is differentiated less [4]. Fig. 2 is an example of K-NN

classification algorithm [7]:

Fig 2. Example of K-NN classification algorithm [7].

Fig. 2 is an example of K-NN classification algorithm by using multi-dimensional feature vector where

triangles represent the first class and squares show the second class. The small circle shows the test example.

Now, if k=3 then the test example belongs to triangle class and if k=5, the example belongs to square class [5].

The training steps of this algorithm are as follows: this algorithm classifies a test document based on k

nearest neighbor. The training examples are introduced as vectors in multi-dimensional feature space. The space

is portioned into areas with training examples. A point in the space is assigned to a class in which the most

training points belonging to that class within the K nearest training example, usually, Euclidean distance or

Cosine similarity are used in this method. In classification phase, a test example is represented as a vector in

feature space and Euclidean distance or Cosine similarity of test vector with whole training vectors is measured

and the K nearest training example is selected. Of course, there are many ways to classify test vector and

therefore, the classic K-NN algorithm determines a test example based on the maximum votes from the k

nearest neighbors. Three main factors in K-NN algorithm are as follows [7]:

1. Distance or similarity criterion to find the K-Nearest Neighbor.

2. K is the number of nearest neighbors.

3. The decision rule is to determine a class for test document from k nearest neighbors.

VasfiSisi and Derakhshi 2013

3.4 Decision Tree

Decision Tree is a classification algorithm whose structure is based on “if-then” classification rules. In this

method, at first we must determine the possible events and draw the tree from the root node. Each node

describes a value taken from gain function [9].

In a decision tree, leaves show similar class of documents and branches represent the conjunction of

features related to that class. A well-structured decision tree can place the class of a document simply in the root

node of tree and allow performing the query structure until reaching a certain leaf which represents the aim of

document. Fig. 3 represents a decision tree classification algorithm [3].

Fig 3. An example a Decision Tree [3]:

The decision tree classification method has dominant advantages over other decision support means. The

main advantage of decision tree is its understanding and interpretation even for non-expert users. Furthermore,

the interpretation of obtained results could be done conveniently by using a simple mathematical algorithms.

Decision tree could experimentally show that the iteration of text classification includes so many appropriate

and related features. An application of decision tree is to personalize advertisement in web pages. A major risk

in implementation of a decision tree is to over fit of training data with the occurrence of an alternative tree that

categorizes the training data worse but would categorizes the document to be categorized better [3].

3.5 Bayesian classification

The Bayesian classification is a simple possibility classification based on an application of Bayesian

theorem with strong independent hypothesis. Description of probabilistic model is independent from description

of features model. The features independency hypothesis makes the order to features unimportant and as a

result, now one feature does not influence on other features in classification. These hypotheses have resulted in

effectiveness of Bayesian classification method’s computation, but this hypothesis limits its application

significantly. According to the precise nature of probabilistic model, the Bayesian classifier could be trained

more effectively with relatively low requirement of training data in order to estimate the necessary parameters

for classification, since we have assumed parameters independent, it is only necessary to determine the variance

of variants for each class, but not covariance of whole matrix [3].

4. Conclusion

Various algorithms or a combination of hybrid algorithms have been purposed for automatic classification

of documents. The Bayesian classification is used well in filtering spam and emails and text classification and

requires a few numbers of training data to estimate essential parameters for classification. Bayesian

classification performs well over text and numerical data and has convenient implementation in comparison

with other algorithms.

Although the hypothesis of conditional independency is contradicted by real world’s data and when the

feature are so dependent to each other it performs so weak and it does not have centralization in the words

occurrence abundance. The advantage of Bayesian classification is that it requires a few training data to estimate

the necessary parameters for classification and its main disadvantage is the relatively low efficiency of

classification in comparison with other detection algorithms.

SVM classification has been known as one the most effective text classification methods in comparison

with supervised machine learning algorithms and provides a perfect precision, but in this case recollection is

reduced. SVM takes the main features from data and replaces it with Structural Risk Minimization (SRM)

principle to minimize the upper bound in error generalization and also, capability to learn could be independent

from feature vector dimensions. K-NN algorithm performs well when so local features of documents are

introduced, while the classification time is longer in this method and it is difficult to find the optimal value to k.

The major advantage of decision tree is its simplicity of understanding.

J. Basic. Appl. Sci. Res., 3(1s)31-35, 2013

REFERENCES

1. Bhavani Dasari, D. and Gopala Rao. K, V., Text Categorization and Machine Learning Methods,

Current State of the Art, Global Journal of Computer Science and Technology Software & Data

Engineering, 2012. 12(11).

2. LIU, X. and FU, H., A Hybrid Algorithm for Text Classification Problem, 2011. Przegląd

Eektrotechniczny (Electrical Review).

3. Khan, A. , Baharudin, B., Hong Lee, L. and Khan, Kh., A Review of Machine Learning

Algorithms for Text-Documents Classification, Journal of Advances in Information Technology,

2010. 1(1).

4. Korde, V. and Mahender, C N. , Text Classification and classifiers, A survey, International Journal

of Artificial Intelligence & Applications (IJAIA), 2012. 3(2).

5. Ananthi, S. and, Sathyabama, S. , Spam Filtering Using K-NN, Journal of Computer Applications,

2009. 2(3).

6. Mahinovs, A. and Tiwari, A., Text Classification Method Review. Decision Engineering Report

Series, 2007.

7. Miah, M. , Improved k-NN Algorithm for Text Classification, In Proceedings of DMIN:2009. P.

434-440.

8. Xiao.li, CH., Pei.yu, L. , Zhen.fang, Z. and Ye, Q., A Method of Spam Filtering Based on

Weighted Support Vector Machines, IEEE International Symposium on IT in Medicine &

Education, 2009. 1.

9. Naksomboon, S. , Charnsripinyo, C. and Wattanapongsakorn, N., Considering Behavior of Sender

in Spam Mail Detection. International Conference on Networked Computing (INC 2010), 2010.

Gyeongju, South Korea.

10. Han, E. H. S. and Karypis, G., Centroid-based document classification, Analysis and experimental

results, 2000, Springer Berlin Heidelberg. p. 424-431.

Sistema de clasificación de documentos basado en categorías de Wikipedia

Conference Paper

Full-text available

May 2018

La información en ésta época ha ganado mucha importancia para las personas y para las empresas, sin embargo, el exceso de información puede convertirse en un problema cuando no se organiza de una forma adecuada. Es común descargar documentos de Internet y tenerlos en distintos lugares de la computadora, al cabo de un tiempo hay muchos archivos desorganizados. La propuesta de esta investigación emplea las categorías y documentos de Wikipedia para clasificar documentos de texto en las mismas categorías que la enciclopedia provee. Se hizo un experimento para encontrar al mejor clasificador de documentos bajo las condiciones de Wikipedia, el mejor rendimiento lo tuvo la máquina de soporte vectorial. Después, en otro experimento se clasificaron documentos con el sistema prototipo desarrollado obteniendo un rendimiento de $0.84$. Este trabajo establece que los artículos de Wikipedia son una fuente para clasificar documentos con un rendimiento confiable.

Multimodal Price Prediction

Article

Full-text available

Apr 2021

Price prediction is one of the examples related to forecasting tasks and is a project based on data science. Price prediction analyzes data and predicts the cost of new products. The goal of this research is to achieve an arrangement to predict the price of a cellphone based on its specifications. So, five deep learning models are proposed to predict the price range of a cellphone, one unimodal and four multimodal approaches. The multimodal methods predict the prices based on the graphical and non-graphical features of cellphones that have an important effect on their valorizations. Also, to evaluate the efficiency of the proposed methods, a cellphone dataset has been gathered from GSMArena. The experimental results show 88.3% F1-score, which confirms that multimodal learning leads to more accurate predictions than state-of-the-art techniques.

Mining Newsgroups Using Ensemble Classifiers in Social Network Analysis

Article

Jun 2017

M. Govindarajan

Lexicon-Based Sentiment Analysis in the Social Web

Article

Full-text available

Jan 2014

Sentiment analysis is a compelling issue for both information producers and consumers. We are living in the " age of customer " , where customer knowledge and perception is a key for running successful business. The goal of sentiment analysis is to recognize and express emotions digitally. This paper presents the lexicon-based framework for sentiment classification, which classifies tweets as a positive, negative, or neutral. The proposed framework also detects and scores the slangs used in the tweets. The comparative results show that the proposed system outperforms the existing systems. It achieves 92% accuracy in binary classification and 87% in multi-class classification.

Hybrid Tools and Techniques for Sentiment Analysis: A Review

Article

Full-text available

Jun 2017

Sentiment analysis and opinion mining is closely coupled with each other. An extensive research work is being carried out in these areas by using different methodologies. Sentiments in a given text are identified by these methodologies as either positive, negative or neutral. Tweets, facebook posts, user comments about certain topics and reviews regarding product, software and movies can be the good source of information. Sentiment Analysis techniques can be used on such data by businesses executives for future planning and forecasting. As the data is obtained from multiple sources and it depends directly on the user which can be from any part of the world so the noisiness in data is a common issue such as mistake in spellings, grammatical errors and improper punctuation. Different approaches are available for sentiment analysis which can automatically sort and categorize the data. These approaches are mainly categorized as Machine Learning based, Lexicon based and Hybrid. A hybrid approach is the combination of machine learning and lexicon based approach for the optimum results, this approach generally yields better results. In this research work different hybrid techniques and tools have been discussed and analyzed from different aspects.

Gender prediction on a real life blog data set using LSI and KNN

Conference Paper

Jan 2017

Gender prediction on social media data set is usually tackled as a text classification problem and can be solved using machine learning methods such as K-nearest neighbor algorithm (KNN). However, KNN is computationally costly due to its lazy learning pattern; it does not perform well when the dimension of feature space is high. Dimension reduction methods are thus introduced and integrated into KNN to save the computation time. In this paper we proposed an approach which combines the Latent Semantic Indexing (LSI) method to KNN to predict the gender based on a real life collection of posts on actual blog pages. Its effectiveness in processing large scale and high dimensional data is demonstrated by experimental results.

A Review of Machine Learning Algorithms for Text-Documents Classification

Article

Full-text available

Feb 2010

With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing techniques to get meaningful knowledge. The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques. This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing literature.

Centroid-Based Document Classification: Analysis and Experimental Results

Article

Jan 2000
Lect Notes Comput Sci

In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroid-based classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes.

A Hybrid Algorithm for Text Classification Problem

Article

Jan 2012
PRZ ELEKTROTECHNICZN

This paper investigates a novel algorithm-EGA-SVM for text classification problem by combining support vector machines (SVM) with elitist genetic algorithm (GA). The new algorithm uses EGA, which is based on elite survival strategy, to optimize the parameters of SVM. Iris dataset and one hundred pieces of news reports in Chinese news are chosen to compare EGA-SVM, GA-SVM and traditional SVM. The results of numerical experiments show that EGA-SVM can improve classification performance effectively than the other algorithms. This text classification algorithm can be extended easily to apply to literatures in the field of electrical engineering.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY

Article

As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare the some existing classifier on basis of few criteria like time complexity, principal and performance.

Considering behavior of sender in spam mail detection

Conference Paper

Jun 2010

Recently, the number of spam mails is exponentially growing. It affects the costs of organizations and annoying the e-mail recipient. Spammers always try to find the way to avoid filtering out from the email system. At the same time, as an email recipient or network system/administrator, we try to have an effective spam mail filtering technique to catch the spam mails. The problems of spam mail filtering are that each user has different perspective toward spam mails; so there are many types of spam mails, while the challenge is how to detect the various types and forms of spam mails. In this paper, behaviors of spammers are used to customize the filtering rule. The information from the spam messages also can be used to filter spam mails and it can give higher accuracy than the keyword-based method does. We propose a spam classification approach using Random Forest algorithm. Spam Assassin Corpus is selected as a database for classification. It consists of 6,047 email messages, where 4,150 of them are the legitimate messages and the other 1,897 messages are the spam mails.

Improved k-NN Algorithm for Text Classification.

Conference Paper

Jan 2009

Muhammed Miah

Over the last twenty years, text classification has become one of the key techniques for organizing electronic information such as text and web documents. The k-Nearest Neighbor (k-NN) algorithm is a very well known and popular algorithm for text classification. The k-NN algorithm determines the classification of new document by the class of its k-nearest neighbor. In this paper we propose an improved k-NN algorithm with a built-in technique to skip a document from training corpus without looking inside the document if it is not important, which improves the performance of the algorithm. It also has an improved decision rule to identify class from k-nearest neighbor to improve the accuracy by avoiding bias of dominating class with large number of documents. We conduct experiments on benchmark text classification datasets. The new and improved k-NN algorithm is suitable for other applications as well.

Text Categorization and Machine Learning Methods, Current State of the Art

Jan 2012
12

D Bhavani Dasari
Gopala K Rao

Bhavani Dasari, D. and Gopala Rao. K, V., Text Categorization and Machine Learning Methods, Current State of the Art, Global Journal of Computer Science and Technology Software & Data Engineering, 2012. 12(11).

Spam Filtering Using K-NN

Jan 2009

S Ananthi
S Sathyabama

Ananthi, S. and, Sathyabama, S., Spam Filtering Using K-NN, Journal of Computer Applications, 2009. 2(3).

Jan 2007

A Mahinovs
A Tiwari

Mahinovs, A. and Tiwari, A., Text Classification Method Review. Decision Engineering Report Series, 2007.

A Method of Spam Filtering Based on Weighted Support Vector Machines

Jan 2009

Xiao
C H Li
Pei
L Yu
Zhen
Z Fang
Q Ye

Xiao.li, CH., Pei.yu, L., Zhen.fang, Z. and Ye, Q., A Method of Spam Filtering Based on Weighted Support Vector Machines, IEEE International Symposium on IT in Medicine & Education, 2009. 1.

Text classification with machine learning algorithms

Abstract and Figures

Recommended publications

A Review of Machine Learning Algorithms for Text-Documents Classification

Survey on Text Classification

Three-Class Classification of Persian Emails by Naïve Bayes Algorithm

A Comparative Study of Parametric Versus Non-Parametric Text Classification Algorithms