ArticlePDF Available

TTC-3600: A new benchmark dataset for Turkish text categorization

December 2017
Journal of Information Science 43(2):174-185

December 2017
43(2):174-185

DOI:10.1177/0165551515620551

Authors:

Deniz Kilinç

İzmir Bakırçay University

Akin Ozcift

Manisa Celal Bayar University

Fatma Bozyiğit

İzmir Bakırçay University

Pelin Yildirim Taser

İzmir Bakırçay University

Show all 6 authorsHide

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.

. TTC-3600 dataset versions

…

. The number of remaining features after feature selection methods.

…

. The effect of FS methods on the experimental results.

…

Figures - uploaded by Deniz Kilinç

Content may be subject to copyright.

Content uploaded by Deniz Kilinç

Content may be subject to copyright.

Article

Corresponding author:

Deniz Kılınç, Department of Software Engineering, Faculty of Technology, Celal Bayar University, Manisa, Turkey.

Email address: drdenizkilinc@gmail.com

Journal of Information Science

1–13

Reprints and permissions:

sagepub.co.uk/journalsPermissions.nav

DOI: 10.1177/0165551510000000

jis.sagepub.com

Journal of Information Science

1–13

Reprints and permission:

sagepub.co.uk/journalsPermissions.nav

DOI: 10.1177/1550059413486272

jis.sagepub.com

TTC-3600: A new benchmark dataset

for Turkish text categorization

Deniz Kılınç

Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey

Akın Özçift

Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey

Fatma Bozyigit

Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey

Pelin Yıldırım

Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey

Fatih Yücalar

Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey

Emin Borandag

Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey

Abstract

Due to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively

increases with each passing day. Considering news portals in particular, sometimes, documents related to the categories such as

technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At

this point, text categorization (TC) that is generally addressed as a supervised learning task is needed. Although there are

substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited due

to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used

in the studies of TC about Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are

compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods

are evaluated on TTC-3600. The experimental results indicate that the best accuracy (ACC) criterion value 91.03% is obtained with

the combination of Random Forest (RF) classifier and attribute ranking based feature selection method in all comparisons

performed after pre-processing and feature selection steps. Publicly available TTC-3600 dataset and the experimental results of this

study can be utilized in the comparative experiments by other researchers.

Keywords

Text classification; Turkish text categorization; feature selection; TTC-3600 dataset

1. Introduction

The rapid growth of the World Wide Web and the Internet use leads to a rapid increase in the amount of unstructured

data on the Internet with each passing day. According to the International Data Corporation (IDC), the amount of

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 2

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

unstructured data on the Internet will exceed up to 40 zettabytes by 2020 and this means that this data will be 50 times

larger than the amount of unstructured data on the Internet in 20101. Manual categorization of this unstructured data is

almost impossible and there is a need for a continuous automatic categorization process in order to make this data more

manageable and reachable. Considering news portals in particular, sometimes, documents related to the categories such

as technology, sports, politics and health seem to be in the wrong category or documents are located in a generic

category called others. At this point, approaches and methods in the field of Text Mining (TM) [1], which is an

important research area, are needed. The purpose of TM, which is also known as Intelligent Text Analysis, Knowledge

Discovery in Text and Text Data Mining in the literature, is extracting valuable and significant information and

knowledge from unstructured text documents [2]. TM is an interdisciplinary field that can use machine learning [3],

computational linguistics, information retrieval and statistics compositely. One of the most widely utilized methods in

the TM studies is the method of Text Categorization/Classification (TC) within the supervised-learning category in the

field of machine learning. TC creates a model benefiting from a pre-defined set of data and aims to assign uncategorized

data into a correct category [4]. In other words, it evaluates uncategorized data based on its content and categorizes it.

One of the most important characteristics of TC is its high dimensionality, in which thousands of features can be

generated [5]. Most of the features are irrelevant and result in poor performance of the classifier. Hence, the

dimensionality reduction that removes redundant and irrelevant features from dataset before evaluating machine

learning algorithms is a critical step in TC. Feature selection is the most widely used dimensionality reduction

technique, which selects a relevant subset from the entire features [6].

In this study, a new dataset called TTC-3600, which can be widely used in the studies of TC regarding Turkish news

and articles, is created and comprehensive experimental studies are performed on this dataset. Considering the literature,

although there are a substantial number of studies conducted on TC in other languages, the number of studies conducted

in Turkish is very limited. All TC studies available in Turkish in the literate are investigated within the scope of this

paper. Since dataset used in other studies are not available or created for different purposes, the dataset used in this

study consists of news collected from six news portals and agencies that are very well known in Turkey and this dataset

has become publicly available in order to be used in the experimental work of other researchers9.

Three different versions of TTC-3600, which are subjected to stemming, are also created and utilized in order to

observe the effect of pre-processing on Turkish TC. In machine learning domain, various types of TC algorithms such as

lazy learning, statistical learning and decision tree induction exist. Among these, selection of the best single performing

one is a challenging task as indicated by No Free Lunch (NFL) theorem [7]. Based on this theorem, five well-known

classifiers Naïve Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), Decision Tree (J48) and

Random Forest (RF) in the field of TC are evaluated on all versions of TTC-3600 dataset.

In addition to these experimental studies, impacts of dimensionality reduction methods on Turkish TC are also

observed during experimental studies. Correlation-based feature selection (CFS) and attribute ranking-based (ARFS)

feature selection methods are employed in order to evaluate the results of dimensionality reduction technique. The

experimental results show that RF classifier is more accurate in all stemming steps (F5, F7 and Zemberek) and feature

selection methods applied on TTC-3600 dataset and the best ACC result is obtained after applying ARFS on Zemb-DS

dataset.

The rest of the paper is organized as follows: The second section offers a comprehensive literature study about TC. In

the third section, materials and methods utilized are introduced briefly. Section four presents the experimental stud and

discusses the experimental results obtained. Finally, the fifth section concludes the paper with some future directions.

2. Related Works

Considering the previous studies in the literature, although there are many studies conducted on TC in other languages,

the number of TC researches conducted in Turkish is very limited. For instance, there are many TC studies conducted in

English, which is one of the most spoken languages in the World [8-10]. In addition to this, there are interesting

researches in the literature performed in some other languages like Arabic that has different morphological properties. In

the study of Hmeidi et al. [11], it is aimed to assign articles written in Arabic into the relevant categories. In the study,

five well-known algorithms in the field of TC are discussed and success rates used by these algorithms are compared

with each other. The other study for Arabic TC proposed by Shaalan and Qudash [12] combines different machine

learning algorithms in order to perform named entity recognition. It is claimed that the success rate of the study exceeds

up to 90% and it gives accurate results. Al-Radaideh et al. [13] conducted a study to detect spam emails composed in

Arabic. It is claimed that they obtained accurate results from 87% of the messages in the dataset by using Graham

statistical filter and rule based filter.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 3

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

The aim of this study is investigating Turkish text categorization. In a study conducted by Güran et al. [14], NB,

Multinominal Naïve Bayes (MNB), J48 and K-NN TC algorithms are evaluated. Their proposed study is based on N-

gram algorithm. They have realized their experimental studies on documents that are either pre-processed or not.

According to the results of experimental evaluation, the worst results are obtained when bi-gram, tri-gram representation

and K-NN algorithm are performed together. J48 classifier gives the best classification results in general.

In another study conducted by Torunoglu et al. [15], importance of pre-processing steps in Turkish TC are observed.

Different pre-processing methods and four TC algorithms NB, MNB, SVM and K-NN are evaluated. Considering

experimental results, it is concluded that the step of pre-processing didn’t show the expected impact on Turkish TC.

Akkuş and Cakıcı [16] suggested that morphological analysis would be a useful method for TC in languages with

semantic richness like Turkish. In the study, contribution of morphological analysis on Turkish TC is studied. First,

stems of the words are identified by using Fixed Length Stemmer method and K-NN, SVM and NB learning algorithms

are evaluated on these stems. Considering the evaluation results conducted on the dataset, using a simple approximation

with first five characters to represent documents instead of results of an expensive morphological analysis gives similar

or better results with much less cost.

In the study of Amasyalı and Beken [17], a different approach regarding TC is presented. They assign words of a text

document into a semantic space they have created. They indicate that representing words in semantic space gives better

results compared to bag of words model. According to the experimental results they have obtained, Linear Regression

Classification Algorithm gives the most successful results.

Amasyalı and Diri [18] have proposed n-gram approach to achieve TC for Turkish language in their study. They have

evaluated NB, SVM, J48 and Random Forrest classification algorithms. As a result of the study, they have suggested

that classification algorithms conducted with bi-gram give better results compared to classification algorithms

conducted with tri-gram. Considering the results of classification algorithms, NB gives more successful results in

determining the author of the text, whereas SVM gives more accurate results in terms of determination of genre of the

text and gender of the author.

Tüfekçi and Uzun [19] have investigated the effect of different term weighting methods to identify the author of the

text. In the texts, different feature vectors of each document are determined by trying different weighting methods after

identification of stems of the words. MNB, SVM, Decision Tree and Random Forrest classification algorithms are

performed on the vectors created and results are compared with each other. According to the experimental results, the

best results are obtained by using SVM algorithm.

In the study of Çataltepe et al. [20], the effect of stem length derived from words in a text on Turkish TC is studied.

They have obtained short stems from long stems using various methods regardless of meaning of the words. They have

aimed to compare accuracy rates formed as a result of classifying vectors weighted using   method and

obtained from stems containing fewer characters. As a result, it is observed that Centroid classification method

conducted with shortened stems gives better results.

In a study conducted by Alparslan et al. [21], it is aimed to conduct information extraction from documents classified

within Turkish language. First, they have extracted word stems using stemming algorithms which are particularly used

for Turkish text documents. Document term matrices are formed by using   weighting method with stems

obtained after pre-processing. Unlike other studies, SVM and Adaptive Neuro Fuzzy classification algorithms are

combined in this study. Considering the experimental results of this method, the method proposed seems to be more

accurate.

In the study of Uysal and Gunal [22], it is indicated that pre-processing is important to TC. Emails and news written

in both English and Turkish are used as the dataset. It is determined that in which way pre-processing methods affect

classification of the text documents. They found that how tokenization, stop-word removal, lowercase conversion and

stemming processes and their various combinations affect accuracy rate of SVM classification algorithms. As a result, it

is seen that some pre-processing methods reduce accuracy rate of classification of text documents, while lowercase

conversion and stop-word removal processes improve accuracy rate of classification of the text documents.

Gunal [23] conducted studies regarding the effect of different feature selection approaches on TC. In these studies, a

hybrid selection method is proposed by combining filter and wrapper feature selection methods. According to these

studies, features obtained by this method gives better results in Turkish TC compared to single selection method.

There are also some other Turkish text analyses and text retrieval studies other than studies performed on Turkish

TC. Özalp et al. [24] conducted studies to detect slang words in news and comments made for articles and columns on

the Internet. They proposed a study that can automatically filter comments made for online articles, magazines and news

texts on the Internet. Unlike most widely used classification studies in the literature, they proposed an irregularity based

approach. This method is suggested to be advantageous in terms of memory management and low counting complexity.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 4

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

In the study of Özgür et al. [25], an anti-spam filtering method developed for Turkish in particular and specific to

agglutinative languages is proposed. The study consists of two separate modules as follows; Learning Module,

Morphology Module. In their study, they have used both Artificial Neural Network and Bayesian Network algorithms.

They claim that they achieved a success rate of 90% to find spam emails in Turkish on the dataset.

Can et al. [26] hypothesized the items that can affect performance of the text retrieval process and tested validity of

these hypotheses one by one. First, they accepted the hypothesis suggesting that creating a stop-word list and removing

these stop-words would affect the retrieval performance; however, according to the tests conducted, this process doesn’t

have a significant impact on the text retrieval process.

Kılıçaslan et al. [27] studied on anaphora resolution on Turkish texts. They compared different methods to identify

pronouns in Turkish texts. In the study, the success of different machine learning algorithms used to analyze Turkish

text documents is evaluated. Considering success rates of anaphora resolution, learning models are suggested to be more

successful than baselines.

3. Materials and methods applied

3.1. Turkish language overview

Turkish belongs to Altaic branch of Ural-Altaic family of languages. The distinctive characteristics of Turkish are vowel

harmony and extensive agglutination that refers to the process of adding suffixes to a stem. It is possible to give the

meaning of a sentence in English by only one word in Turkish. For example, the English sentence “We were not

sleeping” is a single word in Turkish: “sleep” is the stem, and elements meaning “not,” “-ing,” “we,” and “were” are all

suffixed to it: “Uyumuyorduk”. Turkish is derived from the Latin alphabet consisting 8 vowels (a, e, ı, i, o, ö, u, ü) and

21 consonants (b, c, ç, d, f, g, ğ, h, j, k, l, m, n, p, r, s, ş, t, v, y, z) and 7 of these letters are modified from their original

versions in Latin alphabet (ç, ı, ş, ö, ü, ğ, İ).

3.2. Pre-processing

Pre-processing is one of the most important steps in order to prepare text dataset before TC. Tokenization, stop-words

elimination and stemming are the most widely used pre-processing methods. In general, removal-based pre-processing

is firstly conducted. All common separators, operators, punctuations and non-printable characters are removed. Then,

stop-words filtering that aims to filter-out the most frequent words is performed.

Finally, stemming is applied to obtain the stem of a word that is morphological root by removing the suffixes that

present grammatical or lexical information about the word. Stemming process is based on a hypothesis suggesting that

“words with the same stem are included in relatively similar concepts”. Since Turkish is an agglutinative language and

thousands of different words can be derived from a root word, stemming is an important step before performing text

categorization. In the present study, fixed prefix stemming (FPS) [26] approach and a directory based Turkish stemmer

called Zemberek [29] is used. FPS is a pseudo stemming method and it recognizes the first “n” character in the text and

accepts it as the stem. Zemberek is a general-purpose open source NLP toolkit and it includes a suffix dictionary created

for stemming.

3.3. Feature representation and weighting

Machine learning classifiers generally handle text documents as bag of words (BoW). Vector Space Model (VSM) is an

improved version of BoW, where each text document is represented as a vector, and each dimension corresponds to a

separate term (word) [28]. If a term occurs in the document, then its value becomes non-zero in the vector. When it is

considered from TC perspective, the goal is to construct vectors containing features per category by using a training set

of the documents. In VSM, term weighting is a critical step and three major parts that affect the importance of a term in

a text exists as following. Term frequency factor (), the inverse document frequency factor () and document

length normalization. Normalization factor is computed as illustrated in Equation 1.





 (1)

Where, each  equals   as in the equation 2.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 5

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

 





 (2)

Where,  is the  term in the document, .  is the frequency of word in document .  is inverse

document frequency of word  in dataset.  is the number of documents containing the word . N is the total number

of documents in the dataset.

3.4. Text categorization and selected classifiers

As a general description, the aim of TC is classifying uncategorized documents into predefined categories. If we look

from machine learning perspective, the aim of TC is to learn classifiers from labeled documents and fulfill classification

on unlabeled documents. In the literature, there is a rich collection of machine learning classifiers for TC [4]. The

selection of the best performing classifier depends on different parameters such as number of training examples,

dimensionality of the feature space, feature independence, over-fitting, simplicity and system’s requirements.

Considering the high dimensionality and over-fitting characteristics and related researches conducted on TC, five well-

known TC classifiers (NB, SVM, K-NN, J48, and RF) are selected among all TC classifiers. The detailed information

about each classifier selected is illustrated in the following section.

3.4.1. Naïve Bayes

Naïve Bayes (NB) classifier is a well-known statistical supervised learning algorithm based on Bayes' Theorem [30].

Conditional probabilities are calculated using all training sets to determine that in which category the text document

should be classified. Easy implementation and high performance are important advantages of the NB classifier.

Furthermore, it requires a small amount of training data to estimate the parameters and good results are obtained in most

of the cases. Its main disadvantage is that dependencies between features cannot be modeled. NB is frequently applied

in the areas of medical diagnosis, TC, pattern recognition and target marketing and it gives quite successful results. The

simple equation of NB classifier is illustrated in Equation 3.

    (3)

Where,  is the probability of instance d being in class ,  is the probability of generating instance d

given class ,  is the probability of occurrence of class  and  is the probability of instance d occurring.

3.4.2. Support Vector Machine

Support vector machine (SVM), which was introduced in 1992, is a classifier based on statistical information theory and

structural risk minimization. SVM algorithm is divided into two algorithms as linear and nonlinear SVM. In linear SVM

algorithm, an infinite number of hyper-planes are created in order to separate data and maximum-margin hyper-plane is

selected among all these hyper-planes. Nonlinear SVM is used when classes are not linearly separable and data is

transferred into a higher dimensional space. In this way, the data becomes linearly separable [31]. The main advantages

of SVM are high accuracy and being robust against over-ﬁtting via structural risk minimization by using a regularization

parameter. SVM classifier can also work well with an appropriate kernel even if data isn’t linearly separable in the base

feature space. Memory-intensive performance, hard interpretation and determination of the regularization and kernel

parameters and choice of kernel are the disadvantages of SVM2. The main applied areas of SVM classifier are TC,

pattern recognition, bioinformatics and hand-written character recognition.

3.4.3. K-Nearest Neighbor

K-Nearest Neighbor (K-NN), which has no training phase, is an instance based lazy learning classification algorithm

[32]. According to this algorithm, categorization process of the document to be categorized is performed by considering

at the closest k neighbor among documents that have certain class labels. In K-NN algorithm, closeness is defined as a

similarity measure such as Euclidean distance. Equation 4 calculates the Euclidean distance between two instances

 and .

  



 (4)

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 6

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

The implementation of KNN is simple and the cost of the learning phase is zero. It is also robust to noisy data.

Despite having these advantages, K-NN has also some drawbacks. Since description of the learned concepts does not

exist, K-NN model cannot be interpreted and determination of the value of parameter K that specifies the number of

nearest neighbors is not easy. Finally, K-NN is computationally expensive to find the k nearest neighbors in high

dimensions.

3.4.4. J48 Decision Tree

Decision tree learning is a supervised learning method that performs classification process to determine the category of

the input document by creating a decision tree over the available training set [33]. In the decision tree created, internal

nodes represent attributes of the dataset, branches represent the attribute values and leaves represent the classification

label, respectively. J48 classifier is a Java implementation of the C4.5 algorithm, which uses divide-and-conquer

approach for growing the decision tree. J48 is quite successful in the area of TC in particular and has advantages such as

having high performance on large datasets and shorter training duration. It builds models that can be easily interpreted

and can work with both categorical and continuous values. The main disadvantage of J48 is that small variation in

training data may lead to different decision trees.

3.4.5. Random Forest

Random Forest (RF) is an ensemble learning method of decision trees proposed by Leo Breiman and Adele Cutler,

which grows many classification trees. Firstly, subspace of features are randomly selected to construct branches of

decision trees [34]. Then, training data is created to be used to generate each individual tree. Finally, RF classification

model is created by combining all individual trees. All input parameters are passed to each individual tree in the forest

for categorization process of a document. Classification label returns from all trees in the forest and the label with

highest vote is selected as predicted outcome.

RF is a highly accurate classifier that runs efficiently on large datasets and can handle thousands of input features

without any deletion. It has an effective method for estimating missing data and maintains accuracy when a large

proportion of the data is missing. RF also contains an experimental method for detecting feature interactions. RF

classifier may not run effectively in a dataset including categorical variables with various number of levels, because

random forests are biased in favor of those attributes with more levels.

3.5. Feature selection

In a text classification approach, if too many features exist in a dataset, it may result in over-fitting and accuracy of the

classifier will presumably decrease. Besides, as the number of features increases, performing classification becomes

impossible because of the lack of computational resources. Consequently, it is important to remove redundant and

irrelevant features from dataset before evaluating machine learning algorithms [33]. Feature selection is an important

step to reduce dimensionality and remove irrelevant features. Feature selection methods are categorized as filter-based

and wrapper-based methods. Filter-based methods are based on specific characteristics of the training instances for

selecting some features without applying any learning algorithm. On the other hand, wrapper-based methods attempt to

ﬁnd features better suited to a pre-defined learning algorithm or classifier. In a classification task, which has a high

dimensionality characteristic, the ﬁlter-based methods are usually selected because of their computational eciency.

Therefore, two well-known filter-based feature selection approaches are utilized in this research and details of these

approaches are presented in the remaining part of the section.

3.5.1. Correlation-based feature selection (CFS)

The CFS is a filter-based feature selection method used for evaluating subsets of features on the basis of the simple idea:

“Good feature subsets contain features that are highly correlated with the classification, but contrarily have low

correlation with other features” [35]. Equation 5 calculates the merit of a feature subset S including k features.

 







(5)

Where, 



is the average value of all feature-classification correlations, and  is the average value of all feature-

feature correlations. CFS method is usually utilized with a heuristic search strategy such as best first search, greedy

stepwise and genetic search.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 7

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

3.5.2. Attribute ranking-based feature selection (ARFS)

The idea behind ARFS is separately ranking features according to their predictive capabilities for the category and

selecting the top ranking ones. One of the most widely used methods in the area of machine learning is Information

Gain (IG) based method [36]. For each category and feature, an information gain score to be used in the ranking process

is calculated and top N ranking features are selected as the feature subset. The IG of the feature  over the class  is

calculated by equation 6 given below.

() =   





 (6)

Where,  is the proportion of the documents in category c over the total number of documents and  is the

proportion of documents in the category c that contain the feature f over the total number of documents.  is the

proportion of the documents containing the feature f over the total number of documents [37].

4. Experimental study

In this section, we present detailed information about the experimental procedure applied, TTC-3600 dataset created and

conducted, the performance evaluation criteria considered and the experimental results obtained.

4.1. Experimental method

The experiments are performed using the implementations of NB, J48, RF, SVM, and K-NN classifiers in the WEKA

(Waikato Environment for Knowledge Analysis) version 3.6.12 [38]. In this study, the default parameters are set for

each WEKA classifier implemented and feature selection method since these parameters give promising experimental

results [2]. For NB with continuous variables, no kernel method for estimation of the distribution is used. For SVM,

non-linear kernel of degree 3 with WEKA’s default settings is utilized. Default RF parameters 100 and 1 are selected,

where the first number is the number of trees and the second number is the random number seed used for each tree.

Furthermore, the default K-NN and J48 classifier parameters are employed in the research. For K-NN, the value of

parameter k is selected as 1, distance weighting is not applied and Euclidean distance is selected as distance function.

Each classifier is tested with 10-folds cross validation, which is a common strategy for classifier performance

estimation. In this strategy, each dataset is split into 10-blocks. One single block is retained as the validation data for

testing the model and the remaining k − 1 blocks are used as training data. The cross-validation process is then repeated

10 times.

In this study, two FS methods correlation-based (CFS) and attribute raking-based (ARFS) are used in order to

evaluate the performance of feature selection methods applied on TTC-3600 dataset. For CFS, CfsSubsetEval evaluator

of WEKA data mining tool with BestFirst search strategy is used to select the best feature subset. For ARFS,

InfoGainAttributeEval evaluator with Ranker search method is utilized to rank features in accordance with their

information gain score. Instead of empirically selecting N features with highest ranking score, all features that involve in

keeping with an information gain score higher than 0 are selected by setting the value of threshold parameter of Ranker

search method to zero.

4.2. TTC-3600 dataset

Since datasets used in other studies are either not accessible or created for different purposes, a new dataset called TTC-

3600 is created. The most important feature of this dataset, which can be widely used in the studies of TC regarding

Turkish news and articles, is being simple to use and well-documented. The dataset consists of a total of 3600

documents including 600 news/texts from 6 categories like economy, culture-arts, health, politics, sports and technology

are obtained from 6 well-known news portals and agencies (Hürriyet3, Posta4, İha5, HaberTürk6, Radikal7, and Zaman8).

Documents of TTC-3600 dataset are collected between May-July 2015 via Rich Site Summary (RSS) feeds from 6

categories of the respective portals. A special RSS Feeder, which allows to collect XML-Format RSS Feeds from any

portal, is developed by using C# programming language on Visual Studio 2013 IDE to fetch the RSS feeds. In the study,

<title> and <description> XML elements of RSS feeds are taken into consideration for text categorization. Since these

items contain unnecessary data for TC, removal-based pre-processing is conducted. All java scripts, HTML tags

(<img>, <a>, <p>, <strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as

advertising are removed.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 8

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all

versions of datasets, first removal-based pre-processing, which is explained in section 3.2 in detail, is used. Then

Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions etc.) in regard with TC are

removed from datasets except the original dataset. In this study, a semi-automatically constructed stop-words list [26]

that contains 147 words is utilized.

After completing the process of pre-processing, text documents containing stems in all versions of datasets are

transformed into document-term matrix by utilizing text2arff tool [39], which is a feature extraction software by using

 weighting scheme. Then, each matrix belonging to dataset versions are converted into attribute relation file

format (ARFF) that is a proper format for WEKA to be executed.

Table 1 gives information about TTC-3600 dataset version. In datasets F5-DS and F7-DS, stemming is performed by

using FPS approach and first 5 and 7 characters of the words are selected as stem, respectively. In Zemb-DS dataset,

Zemberek NLP toolkit is used as the stemmer. In Original-DS, F5-DS, F7-DS and Zemb-DS datasets; there are 7,508,

3,209, 4,814 and 5,693 words (features), respectively.

Table 1. TTC-3600 dataset versions

Dataset name

Stop-words filtering

Stemmer

Number of documents

Number of features

Original-DS

No-Stemmer

3,600

7,508

F5-DS

Yes

FPS-5

3,600

3,209

F7-DS

Yes

FPS-7

3,600

4,814

Zemb-DS

Yes

Zemberek

3,600

5,693

The dataset and files have become publicly available in order to have repeatable results for experimental evaluation

on TTC-3600 dataset9. Each version of TTC-3600 dataset includes two types of files in addition to original text files that

are pre-processed. The first file with “.txt” extension contains the names and ids of the features, whereas the second file

in ARFF format that describes a list of instances is sharing a set of features.

4.3. Evaluation criteria

In machine learning domain, there are different evaluation criteria used to evaluate classifiers. All criteria are generated

from a confusion matrix [40], which contains actual and predicted classification information. True positives (TP), true

negatives (TN), false positives (FP) and false negatives (FN) denote the four different prediction outcomes. In this

study, the most accepted evaluation criterion; ACC is utilized. Each criterion is described in the following.

Accuracy (ACC) is the most widely used performance evaluation criterion, which is the ratio of the total number of

class files that are classified correctly. It is calculated by using Equation 7 given below.

 

 (7)

Precision is the proportion of correctly classified class files with faults. Recall is the proportion of correctly classified

class files with faults. Precision and Recall are calculated using Equation 8 and 9, respectively.

  

 (8)

  

 (9)

4.4. Experimental results and discussion

Figure 1 presents ACC evaluation criteria results of all classifiers on TTC-3600 dataset. The aim of experimental studies

performed to form this figure is to evaluate the performance of TC classifiers on dataset versions created by using

different stemming methods. Considering the experimental results, RF is evaluated as the most accurate classifier in

terms of ACC. In addition, the highest ACC value is achieved by RF classifier in all datasets regardless of stemming.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 9

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

ACC values obtained by this classifier are 88.6%, 87.9%, 88.3% and 90.1% for Original-DS, F5-DS, F7-DS and Zemb-

DS datasets, respectively.

Figure 1. Experimental ACC percentage results of classifiers on datasets.

On the other hand, K-NN has the lowest ACC values among all classifiers and its ACC results are below 60%. The

closest criteria results to RF is achieved by SVM (except Zemb-DS), which is a kernel based classifier. NB classifier

gives more accurate criteria results in Zemb-DS dataset compared to SVM.

According to the data presented in Figure 1, there is a 3% ACC difference maximum between evaluation criteria

results of classifiers used in the study on Original-DS dataset and evaluation criteria results on three datasets created

after stemming. Considering that Original-DS, F5-DS, F7-DS and Zemb-DS datasets have 7,508, 3,209, 4,814 and 5,693

features, respectively; the number of features is dropped down dramatically; however, the effect of this reduction on

ACC is found to be 3% maximum. This situation indicates that the accuracy effect of pre-processing on experimental

results conducted on Turkish texts before TC is not promising.

Generally, if we evaluate the success of the stemming methods, classifier evaluation results obtained from F5-DS and

F7-DS datasets, which are subjected to stemming by FPS approach, are worse than classifier evaluation results obtained

from the original datasets (except K-NN). Classifier results on Zemb-DS dataset that is created by using Zemberek NLP

toolkit are better than the results obtained from original dataset (except SVM). As a result, in all TTC-3600 datasets, the

stemming process performed by using Zemberek classifier outperforms all other methods.

Table 2. The number of remaining features after feature selection methods.

Without FS

CFS

ARFS

Original-DS

7,508

1,684

F5-DS

3,209

942

F7-DS

4,818

1,241

Zemb-DS

5,693

1,551

In this study, after CFS and ARFS methods are performed in order to observe the effect of FS methods, remaining

numbers of features for each dataset are presented in Table 2. As a result of CFS method, which is combined with a

heuristic search strategy best first search, it is eliminated since about 85% to 90% of the features in the datasets are

found to be irrelevant.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 10

Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000

On the other hand, the number of remaining features is greater than the number of features obtained by CFS after

applying ARFS method. For example, there are 4,818 features in the initial stage of F7-DS dataset, whereas the number

of remaining features at the end of CFS process is 63 and this value is 1,241 at the end of ARFS.

Table 3. The effect of FS methods on the experimental results.

ACC without FS

ACC of CFS

ACC of ARFS

OrgDS

ZembDS

F5DS

F7DS

OrgDS

ZembDS

F5DS

F7DS

OrgDS

ZembDS

F5DS

F7DS

82.94

87.17

82.22

84.03

78.97

80.44

75.25

78.56

82.94

87.19

82.22

84.06

J48

78.06

79.00

77.14

75.50

76.72

78.19

71.67

74.78

78.97

79.39

77.36

75.97

88.53

90.10

87.92

88.25

80.17

81.42

75.44

78.67

88.87

91.03

88.28

88.59

SVM

86.03

84.97

82.39

83.56

69.31

69.61

68.17

69.19

79.53

76.86

74.97

76.92

KNN

52.83

54.00

55.11

52.67

73.11

74.97

69.44

72.56

64.44

65.25

64.33

62.56

Table 3 shows the performance comparison of feature-selection methods in terms of ACC on four datasets. As it can

be seen from Table 3, the ACC performance of the all classifiers except K-NN on all datasets is reduced after applying

CFS method. For example, before the process, ACC values in OrgDS, ZembDS, F5DS and F7DS for NB classifier was

82.94, 87.17, 82.22 and 84.03, respectively; however, they have become 78.97, 80.44, 75.25 and 78.56 after performing

CFS, respectively. A similar decrease is also observed for RF classifier before and after CFS. For J48, since the values

of ACC results are already low, a decrease at minimum level is occurred after CFS.

One of the highest amount of performance decreases is observed in non-linear kernel-based SVM classifier. When

SVM is performed after applying CFS, the decrease observed in the performance of ACC values is around 12%-15%.

SVM classifier, which is one of the state-of-the-art algorithms of today, harnessed the problem of over-ﬁtting via

structural risk minimization by supporting regularization. Any attempt, such as discretization or feature selection will

invalidate the bounds on performance and potentially overwhelms the structural risk minimization principle. Re-

considering the results given in Figure 1, SVM is the only classifier, in which ACC criterion value is not increased, in

Zemb-DS dataset. As a result, SVM has a fairly robust algorithmic design against uninformative features and produces

better results when no selection or reduction is performed.

After performing CFS, the only classifier with significantly increased performance is K-NN. For example, the value

of ACC in OrgDS dataset is increased by about 21% from 52.83% to 73.11%. The main reason of this situation is that

K-NN algorithm is directly affected by a phenomenon called “curse of dimensionality” [41] in the literature in an

environment that has high dimensional properties. More specifically, in high dimensions, Euclidean distance is

ineffective since all vectors are almost at equal distances to the search query vector10. Since the number of features is

significantly decreased after performing CFS, K-NN algorithm has a much more accurate performance compared to its

state with no feature selection.

Consequently, since around 85-90% of the features in TTC-3600 datasets are eliminated after performing CFS and

there are also discriminative features eliminated for categories, ACC performance values of the classifiers except K-NN

are decreased. In addition, it is observed that feature selection implementation of SVM classifier, which is a non-linear

kernel based classifier that can work better in a high dimensional environment, reduces the accuracy.

In addition to the CFS results, considering the ARFS results given in Table 4, after performing ARFS; NB, J48 and

RF classifiers obtained either similar or better ACC results compared to the values obtained from original datasets. For

example, the best ACC value in this study (91.03%) is obtained by performing RF classifier on ZembDS dataset after

applying ARFS. In addition, the ACC values obtained by performing these 3 classifiers after performing ARFS are more

successful compared to the implementation of CFS.

The SVM classifier has a worse performance (7-8%) compared to ACC values in the original dataset after

performing ARFS; on the other hand, it gives much more accurate results and higher ACC values compared to the

results obtained by CFS. This result obtained in TTC-3600 dataset is not surprising when high performance of SVM in a

non-linear based and high dimensional environment is taken into consideration. Because the number of features

remained after performing ARFS is much more than the number of features remained after performing CFS and SVM

that shows better performance in a high dimensional environment.

ACC values of K-NN classifier are increased by about %8-12 in other datasets compared to its values on the original

dataset after applying ARFS; however, its ACC values are decreased compared to CFS. It can be concluded from the

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 11

feature selection experiments that the performance of ARFS is superior to CFS except K-NN that showed a great

success with 74.97% ACC value in ZembDS dataset using only 52 features remained after performing CFS.

Accordingly, it can be speculated that even though ACC results of K-NN classifier decrease in a high dimensional

environment, it can be promising when performed with a small number of features or in other words, when different

dimensionality reduction methods are applied.

Finally, RF classifier is more accurate in all stemming steps (F5, F7, Zemberek) applied in TTC-3600 dataset and

feature selection methods (CFS, ARFS) and the best ACC result is obtained in ZembDS dataset after applying ARFS.

5. Conclusion and future works

In this study, intensive experimental studies on Turkish TC, which are very limited compared to other languages, are

employed and all accessible researches in the literature are discussed. A new dataset called TTC-3600, which can be

widely used in the TC studies regarding Turkish news and articles, is created by collecting news from six well-known

news portals and agencies in Turkey and they have become publicly available in order to be used in comparative

experiments by other researchers. Three different versions of TTC-3600 dataset, which are pre-processed (stemming,

stop-word elimination etc.) and can be used in TC studies regarding Turkish news and articles, are also created in

addition to the original dataset and used in the experiments of the study. The detailed information about TTC-3600

dataset is presented in Section 4.2.

Five well-known classifiers NB, SVM, K-NN, J48 and RF within the field of TC are evaluated on TTC-3600 dataset.

Besides, CFS and ARFS feature selection methods are also utilized in order to observe the impacts of feature selection

methods on Turkish TC. The experimental results indicate that in all comparisons performed after pre-processing and

feature selection steps, RF classifier gives more accurate results and the best ACC value 91.03% is obtained in the

dataset version of Zemb-DS after applying ARFS.

In future studies, other TC classifiers, ensemble learning methods, different types of feature selection approaches and

n-gram based dimensionality reduction method can be used in order to perform researches on TTC-3600 dataset in more

detail. Another future work is constructing a new big data set by collecting much more documents and investigation of

horizontally scaled TC by utilizing a library like Hadoop MapReduce [42].

Notes

1. https://en.wikipedia.org/wiki/Unstructured_data.

2. http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf.

3. http://dosyalar.hurriyet.com.tr/rss.

4. http://www.posta.com.tr/rss.

5. http://www.iha.com.tr/rss.html.

6. http://www.haberturk.com/rss.

7. http://www.radikal.com.tr/rss.

8. http://www.zaman.com.tr/rss_rssMainPage.action?sectionId=341.

9. https://github.com/GitCBU/TTC-3600.

10. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm.

Funding

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

References

[1] Chen SY and Liu X. The contribution of data mining to information science. Journal of Information Science 2004; 30(6): 550-

558.

[2] Amancio DR, et al. A systematic comparison of supervised classifiers. PloS one 2014; 9(4): 94-137.

[3] Michie D, Spiegelhalter DJ and Taylor CC. Machine learning, neural and statistical classification. USA: Ellis Horwood

Limited, 1994.

[4] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys 2002; 34(5): 1–47.

[5] Jieming Y, Zhaoyang Q and Liu Z. Improved feature-selection method considering the imbalance problem in text

categorization.The Scientific World Journal 2014.

[6] Onan A. Classifier and feature set ensembles for web page classification. Journal of Information Science 2015,

10.1177/0165551515591724.

[7] Wolpert DH and Macready WG. No free lunch theorem for search. Technical Report SFI-TR-05-010, Santa Fe Institute, 1995.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 12

[8] Read, J. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of

the ACL student research workshop, 2005, pp. 43-48.

[9] Zhang P and He Z. Using data-driven feature enrichment of text representation and ensemble technique for sentence-level

polarity classification. Journal of Information Science 2015, 10.1177/0165551515585264.

[10] Cavnar WB and Trenkle JM. N-gram based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on

Docu-ment Analysis and Information Retrieval, 1994, pp.161-175.

[11] Ismail H, et al. Automatic Arabic text categorization: A comprehensive comparative study. Journal of Information Science

2015; 41(1): 114-124.

[12] Shaalan K and Oudah M. A hybrid approach to Arabic named entity recognition. Journal of Information Science 2014; 40(1):

67-87.

[13] Al-Radaideh QA, AlEroud AF and Al-Shawakfa EM. A hybrid approach to detecting alerts in Arabic e-mail messages.

Journal of Information Science 2012; 38(1): 87–99.

[14] Güran A, Akyokuş S, Güler N and Gürbüz Z. Turkish text categorization using n-gram words In: Proceedings of the

International Symposium on Innovations in Intelligent Systems and Applications (INISTA), 2009, pp. 369-373.

[15] Torunoğlu D, Çakırman E, Ganiz MC et al. Analysis of preprocessing methods on classification of Turkish texts. In:

Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, 2011, pp. 112-118.

[16] Akkus BK and Ruket C. Categorization of Turkish news documents with morphological analysis. In: Proceedings of the ACL

student research workshop, 2013, pp. 1–8.

[17] Amasyalı MF and Beken A. Measurement of Turkish word semantic similarity and text categorization application. In:

Proceedings of IEEE Signal Processing and Communications Applications Conference, Antalya, Turkey, 9-11 April 2009.

Newyork: IEEE, pp. 1-4.

[18] Amasyali MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: Natural Language

Processing and Information Systems, Springer Berlin Heidelberg, 2006, pp. 221-226.

[19] Tufekci P and Uzun E. Author detection by using different term weighting schemes. In: Proceedings of IEEE Signal

Processing and Communications Applications Conference (SIU), Trabzon, Turkey, 24-26 April 2013. Newyork: IEEE, pp. 1-4.

[20] Çataltepe Z, Turan Y and Kesgin F. Turkish document classification using shorter roots. In: Proceedings of IEEE Signal

Processing and Communications Applications Conference (SIU), Eskisehir, Turkey, 11-13 June 2007. Newyork: IEEE, pp. 1-

[21] Alparslan E, Karahoca A and Bahşi H. Classification of confidential documents by using adaptive neurofuzzy inference

systems. Procedia Computer Science 2011; 3: 1412-1417.

[22] Uysal AK and Gunal S. The impact of preprocessing on text classification. Information Processing and Management 2014; 50:

104-112.

[23] Gunal S. Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Sciences

2012; 20: 1296–1311.

[24] Özalp N, Yılmaz G and Ayan U. Novel comment filtering approach based on outlier on streaming data. In: Proceedings of

IEEE Signal Processing and Communications Applications Conference (SIU), Mugla, Turkey, 18-20 April 2012. Newyork:

IEEE, pp. 1-4.

[25] Özgür L, Güngör T and Gürgen F. Adaptive anti-spam filtering for agglutinative languages. Pattern recognition letters 2004;

25(16): 1819–1831.

[26] Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC and Vursavas OM, Information retrieval on Turkish texts. Journal of the

American Society for Information Science and Technology 2008; 59(3): 407-421.

[27] Kılıçaslan Y, Güner ES and Yıldırım S. Learning-based pronoun resolution for Turkish with a comparative evaluation.

Computer Speech and Language 2009; 23: 311-331.

[28] Salton G and Christopher B. Term-weighting approaches in automatic text retrieval. Information processing & management

1988; 24: 513-523.

[29] Akin AA and Akin MD. Zemberek, an open source NLP framework for Turkic Languages, 2007.

[30] Yildirim P and Birant D. Naive Bayes Classifier for Continuous Variables using Novel Method (NBC4D) and Distributions.

In: Proceedings of IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA)

Proceedings, Alberobello, Italy 23-25 June 2014. Newyork: IEEE, pp. 110–115.

[31] Sebastiani F. Text categorization. Text Mining and Its Applications 2005, pp. 109-129.

[32] Aha DW, Kibler D and Albert MK. Instance-based Learning Algorithms. Machine Learning 1991; 6(1): 37-66.

[33] Quinlan JR. C4.5: Programs for Machine Learning. Machine Learning 1993; 16(3): 235-240.

[34] Xu B, Guo X, Ye Y and Cheng J. An Improved Random Forest Classifier for Text Categorization. Journal of Computers 2012;

7(12): 2913-2920.

[35] Hall M. Correlation-based Feature Selection for Machine Learning. PhD thesis, Department of Computer Science, University

of Waikato, New Zealand, 1999, pp. 51–74.

[36] Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. Proceedings of the 14th International

Conference on Machine Learning (ICML '97), Nashville, Tenn, USA, 1997.Morgan Kaufmann, pp. 412–420.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

Kılınç et al. 13

[37] Youn E, Jeong MK. Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern

Recognition Letters 2009; 30(5): 477–485.

[38] Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. San Francisco, CA: Morgan Kaufman,

2005.

[39] Amasyali M, et al. Text2arff: Automatic feature extraction software for Turkish texts. Signal Processing and Communications

Applications Conference (SIU), Diyarbakir, 22-24 April 2010. Newyork: IEEE, pp. 629 - 632.

[40] Kohavi R and Provost F. On Applied Research in Machine Learning. Machine Learning 1998; 30(2-3): 127–132.

[41] Kevin B, et al. When is “nearest neighbor” meaningful. In: 7th International Conference on Database Theory, Jerusalem,

Israel, 10-12 January 1999. IEEE, pp. 217-235.

[42] Meijing L, Xiuming Y and Ryu KH. MapReduce-based web mining for prediction of web-user navigation. Journal of

Information Science 2014; 40(5): 557-567.

Accepted for Publication

By the Journal of Information Science: http://jis.sagepub.co.uk

ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATION

Article

Jan 2023

Ali Yürekli

News categorization, which is a common application area of text classification, is the task of automatic annotation of news articles with predefined categories. In parallel with the rise of deep learning techniques in the field of machine learning, neural embedding models have been widely utilized to capture hidden relationships and similarities among textual representations of news articles. In this study, we approach the Turkish news categorization problem as an ad-hoc retrieval task and investigate the effectiveness of paragraph vector models to compute and utilize document-wise similarities of Turkish news articles. We propose an ensemble categorization approach that consists of three main stages, namely, document processing, paragraph vector learning, and document similarity estimation. Extensive experiments conducted on the TTC-3600 dataset reveal that the proposed system can reach up to 93.5% classification accuracy, which is a remarkable performance when compared to the baseline and state-of-the-art methods. Moreover, it is also shown that the Distributed Bag of Words version of Paragraph Vectors performs better than the Distributed Memory Model of Paragraph Vectors in terms of both accuracy and computational performance.

Relational Turkish Text Classification Using Distant Supervised Entities and Relations

Article

Full-text available

Jan 2024
CMC-COMPUT MATER CON

Filter feature selection methods for text classification: a review

Article

Full-text available

May 2023
MULTIMED TOOLS APPL

Filter feature selection methods are utilized to select discriminative terms from high-dimensional text data to improve text classification performance and reduce computational costs. This paper aims to provide a comprehensive systematic review of existing filter feature selection methods for text classification. Firstly, we briefly discuss text classification based on filter feature selection. Secondly, we present a detailed discussion on mathematical designs, effectiveness and complexity of existing filter feature selection methods of different methodologies (supervised methods, unsupervised methods and hybrid methods). In addition, a certain number of benchmark datasets for evaluating performance of filter feature selection methods in text classification are also discussion. Finally, we provide future directions in filter feature selection, along with conclusion.

The Effects of Preprocessing on Turkish and English News Data

Article

Mar 2023

Bekir Parlak

In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques are compared on one domain, namely news data, and in two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages are all evaluated in this way. Using best combinations of preprocessing techniques rather than using or not using them all, experimental studies on public datasets reveals that, choosing best combinations of preprocessing techniques can improve classification accuracy significantly.

Strategies for enhancing the performance of news article classification in Bangla: Handling imbalance and interpretation

Article

Jul 2023
ENG APPL ARTIF INTEL

The rapid increase in obtainable online text data has made text categorization an important tool for data analysts to extract relevant information on the web. However, incorrect or incomplete classification of marginalized groups may result from using biased text data. In order to remedy the disparity in available data, this research suggests a system for classifying and analyzing Bangla news articles. The suggested approach first uses both Random Under-Sampling (RUS) and Synthetic Minority Oversampling Techniques to balance the massive unbalanced Bangla News dataset consisting of 4,37,948 instances (SMOTE). Secondly, the proposed system employs three machine learning models: Logistic Regression, Decision Tree, and Stochastic Gradient Descent along with three deep learning models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Bidirectional Encoder Representations from Transformers (BERT) for Bangla text categorization. The experimental results signify the superior performance of BERT to other classification models of the system as well as other existing methods in this domain. The proposed system achieves the maximum accuracy of 99.04% in balanced dataset and 72.23% in imbalanced dataset using BERT. K-fold cross validation with varied K values is used to determine the performance consistency of BERT. Finally, both LIME (Local Interpretable Model agnostic Explanations and SHAP (SHapley Additive exPlanations) techniques are applied for interpreting each prediction made by BERT.

Metin Sınıflandırma Yöntemleri: Türkçe Uygulamalar ve İngilizce Modellerin Adaptasyonu Üzerine Kapsamlı Bir İnceleme

Chapter

Full-text available

Dec 2023

Feature selection based on long short term memory for text classification

Article

Full-text available

Oct 2023
MULTIMED TOOLS APPL

The selection of discriminative terms from large quantity of terms in text documents is helpful for achieving better accuracy of text classification. To focus on the task of selecting discriminative terms from text, a deep learning based feature selection method is proposed. The method is developed by using the long short term memory (LSTM) network. A deep network based on LSTM is trained in unsupervised manner to extracted deep features from bag-of-words term frequency vectors. The deep features are integrated with term frequencies to evaluate the effectiveness of terms. The proposed method extends the limitation of term frequency information by applying deep features for feature selection. Experiments in nine public datasets demonstrate better performance of our method in selecting discriminative terms than comparative methods.

DEEP LEARNING-BASED CUSTOMER COMPLAINT MANAGEMENT

Article

Jun 2023

Comparison of pre-trained language models in terms of carbon emissions, time and accuracy in multi-label text classification using AutoML

Article

Full-text available

May 2023

Since Turkish is an agglutinative language and contains reduplication, idiom, and metaphor words, Turkish texts are sources of information with extremely rich meanings. For this reason, the processing and classification of Turkish texts according to their characteristics is both time-consuming and difficult. In this study, the performances of pre-trained language models for multi-text classification using Autotrain were compared in a 250 K Turkish dataset that we created. The results showed that the BERTurk (uncased, 128 k) language model on the dataset showed higher accuracy performance with a training time of 66 min compared to the other models and the CO2 emission was quite low. The ConvBERTurk mC4 (uncased) model is also the best-performing second language model. As a result of this study, we have provided a deeper understanding of the capabilities of pre-trained language models for Turkish on machine learning.

Unified benchmark for zero-shot Turkish text classification

Article

May 2023
INFORM PROCESS MANAG

Glossary of terms. Special issue of applications of machine learning and the knowledge discovery process

Article

Full-text available

Jan 1998

Classifier and feature set ensembles for web page classification

Article

Full-text available

Jun 2015
J INF SCI

Aytug Onan

Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of assigning a web page to a particular predefined category based on labelled data. It serves for several other web mining tasks, such as focused web crawling, web link analysis and contextual advertising. Machine learning and data mining methods have been successfully applied for several web mining tasks, including web page classification. Multiple classifier systems are a promising research direction in machine learning, which aims to combine several classifiers by differentiating base classifiers and/or dataset distributions so that more robust classification models can be built. This paper presents a comparative analysis of four different feature selections (correlation, consistency, information gain and chi-square-based feature selection) and four different ensemble learning methods (Boosting, Bagging, Dagging and Random Subspace) based on four different base learners (naive Bayes, K-nearest neighbour algorithm, C4.5 algorithm and FURIA algorithm). The article examines the predictive performance of ensemble methods for web page classification. The experimental results indicate that feature selection and ensemble learning can enhance the predictive performance of classifiers in web page classification. For the DMOZ-50 dataset, the highest average predictive performance (88.1%) is obtained with the combination of consistency-based feature selection with AdaBoost and naive Bayes algorithms, which is a promising result for web page classification. Experimental results indicate that Bagging and Random Subspace ensemble methods and correlation-based and consistency-based feature selection methods obtain better results in terms of accuracy rates.

Analysis of Preprocessing Methods on Classification of Turkish Texts

Conference Paper

Full-text available

Jan 2011

Preprocessing is an important task and critical step in information retrieval and text mining. The objective of this study is to analyze the effect of preprocessing methods in text classification on Turkish texts. We compiled two large datasets from Turkish newspapers using a crawler. On these compiled data sets and using two additional datasets, we perform a detailed analysis of preprocessing methods such as stemming ,stopword filtering and word weighting for Turkish text classification on several different Turkish datasets. We report the results of extensive experiments.

Automatic Arabic text categorization: A comprehensive comparative study

Article

Full-text available

Jan 2014
J INF SCI

Text categorization or classification (TC) is concerned with placing text documents in their proper category according to their contents. Owing to the various applications of TC and the large volume of text documents uploaded on the Internet daily, the need for such an automated method stems from the difficulty and tedium of performing such a process manually. The usefulness of TC is manifested in different fields and needs. For instance, the ability to automatically classify an article or an email into its right class (Arts, Economics, Politics, Sports, etc.) would be appreciated by individual users as well as companies. This paper is concerned with TC of Arabic articles. It contains a comparison of the five best known algorithms for TC. It also studies the effects of utilizing different Arabic stemmers (light and root-based stemmers) on the effectiveness of these classifiers. Furthermore, a comparison between different data mining software tools (Weka and RapidMiner) is presented. The results illustrate the good accuracy provided by the SVM classifier, especially when used with the light10 stemmer. This outcome can be used in future as a baseline to compare with other unexplored classifiers and Arabic stemmers.

Data Mining: practical machine learning tools and techniques

Book

Jan 2011

Text categorization

Chapter

May 2005

Fabrizio Sebastiani

Using data-driven feature enrichment of text representation and ensemble technique for sentence-level polarity classification

Article

Aug 2015
J INF SCI

As an important issue in sentiment analysis, sentence-level polarity classification plays a critical role in many opinion-mining applications such as opinion question answering, opinion retrieval and opinion summarization. Employing a supervised learning paradigm to train a classifier from sentences often faces the data sparseness problem owing to the short-length limit introduced to texts. In this article, regarding this problem, we exploit two different feature sets learned from external data sets as additional features to enrich data representation: one is a latent topic feature set obtained using a topic model, and the other is a related word feature set derived using word embeddings. Furthermore, we propose an ensemble approach by using these additional features to guide the design of different members of the ensemble. Experimental results on the public movie review dataset demonstrate that the enriched representations are effective for improving the performance of polarity classification, and the proposed ensemble approach can further improve the overall performance.

An Improved Random Forest Classifier for Text Categorization

Article

Dec 2012

This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance.

MapReduce-based web mining for prediction of web-user navigation

Article

Sep 2014
J INF SCI

Predicting web user behaviour is typically an application for finding frequent sequence patterns. With the rapid growth of the Internet, a large amount of information is stored in web logs. Traditional frequent-sequence-pattern-mining algorithms are hard pressed to analyse information from within big datasets. In this paper, we propose an efficient way to predict navigation patterns of web users by improving frequent-sequence-pattern-mining algorithms based on the programming model of MapReduce, which can handle huge datasets efficiently. During the experiments, we show that our proposed MapReduce-based algorithm is more efficient than traditional frequent-sequence-pattern-mining algorithms, and by comparing our proposed algorithms with current existed algorithms in web-usage mining, we also prove that using the MapReduce programming model saves time.

Categorization of Turkish News Documents with Morphological Analysis

Conference Paper

Aug 2013

TTC-3600: A new benchmark dataset for Turkish text categorization

Abstract and Figures

Recommended publications

A novel weighting formula and feature selection for text classification based on rough set theory

Information Retrieval: A New Multilingual Stemmer Based on a Statistical Approach

A Systematic study of Text Mining Techniques

Survey on Pre-Processing Techniques for Text Mining