ArticlePDF Available

Abstract and Figures

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.
Content may be subject to copyright.
Article
Corresponding author:
Deniz Kılınç, Department of Software Engineering, Faculty of Technology, Celal Bayar University, Manisa, Turkey.
Email address: drdenizkilinc@gmail.com
Journal of Information Science
113
© The Author(s) 2015
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0165551510000000
jis.sagepub.com
Journal of Information Science
113
© The Author(s) 2014
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1550059413486272
jis.sagepub.com
TTC-3600: A new benchmark dataset
for Turkish text categorization
Deniz Kılınç
Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey
Akın Özçift
Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey
Fatma Bozyigit
Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey
Pelin Yıldırım
Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey
Fatih Yücalar
Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey
Emin Borandag
Department of Software Engineering, Faculty of Technology, Celal Bayar University, Turkey
Abstract
Due to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively
increases with each passing day. Considering news portals in particular, sometimes, documents related to the categories such as
technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At
this point, text categorization (TC) that is generally addressed as a supervised learning task is needed. Although there are
substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited due
to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used
in the studies of TC about Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are
compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods
are evaluated on TTC-3600. The experimental results indicate that the best accuracy (ACC) criterion value 91.03% is obtained with
the combination of Random Forest (RF) classifier and attribute ranking based feature selection method in all comparisons
performed after pre-processing and feature selection steps. Publicly available TTC-3600 dataset and the experimental results of this
study can be utilized in the comparative experiments by other researchers.
Keywords
Text classification; Turkish text categorization; feature selection; TTC-3600 dataset
1. Introduction
The rapid growth of the World Wide Web and the Internet use leads to a rapid increase in the amount of unstructured
data on the Internet with each passing day. According to the International Data Corporation (IDC), the amount of
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 2
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
unstructured data on the Internet will exceed up to 40 zettabytes by 2020 and this means that this data will be 50 times
larger than the amount of unstructured data on the Internet in 20101. Manual categorization of this unstructured data is
almost impossible and there is a need for a continuous automatic categorization process in order to make this data more
manageable and reachable. Considering news portals in particular, sometimes, documents related to the categories such
as technology, sports, politics and health seem to be in the wrong category or documents are located in a generic
category called others. At this point, approaches and methods in the field of Text Mining (TM) [1], which is an
important research area, are needed. The purpose of TM, which is also known as Intelligent Text Analysis, Knowledge
Discovery in Text and Text Data Mining in the literature, is extracting valuable and significant information and
knowledge from unstructured text documents [2]. TM is an interdisciplinary field that can use machine learning [3],
computational linguistics, information retrieval and statistics compositely. One of the most widely utilized methods in
the TM studies is the method of Text Categorization/Classification (TC) within the supervised-learning category in the
field of machine learning. TC creates a model benefiting from a pre-defined set of data and aims to assign uncategorized
data into a correct category [4]. In other words, it evaluates uncategorized data based on its content and categorizes it.
One of the most important characteristics of TC is its high dimensionality, in which thousands of features can be
generated [5]. Most of the features are irrelevant and result in poor performance of the classifier. Hence, the
dimensionality reduction that removes redundant and irrelevant features from dataset before evaluating machine
learning algorithms is a critical step in TC. Feature selection is the most widely used dimensionality reduction
technique, which selects a relevant subset from the entire features [6].
In this study, a new dataset called TTC-3600, which can be widely used in the studies of TC regarding Turkish news
and articles, is created and comprehensive experimental studies are performed on this dataset. Considering the literature,
although there are a substantial number of studies conducted on TC in other languages, the number of studies conducted
in Turkish is very limited. All TC studies available in Turkish in the literate are investigated within the scope of this
paper. Since dataset used in other studies are not available or created for different purposes, the dataset used in this
study consists of news collected from six news portals and agencies that are very well known in Turkey and this dataset
has become publicly available in order to be used in the experimental work of other researchers9.
Three different versions of TTC-3600, which are subjected to stemming, are also created and utilized in order to
observe the effect of pre-processing on Turkish TC. In machine learning domain, various types of TC algorithms such as
lazy learning, statistical learning and decision tree induction exist. Among these, selection of the best single performing
one is a challenging task as indicated by No Free Lunch (NFL) theorem [7]. Based on this theorem, five well-known
classifiers Naïve Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), Decision Tree (J48) and
Random Forest (RF) in the field of TC are evaluated on all versions of TTC-3600 dataset.
In addition to these experimental studies, impacts of dimensionality reduction methods on Turkish TC are also
observed during experimental studies. Correlation-based feature selection (CFS) and attribute ranking-based (ARFS)
feature selection methods are employed in order to evaluate the results of dimensionality reduction technique. The
experimental results show that RF classifier is more accurate in all stemming steps (F5, F7 and Zemberek) and feature
selection methods applied on TTC-3600 dataset and the best ACC result is obtained after applying ARFS on Zemb-DS
dataset.
The rest of the paper is organized as follows: The second section offers a comprehensive literature study about TC. In
the third section, materials and methods utilized are introduced briefly. Section four presents the experimental stud and
discusses the experimental results obtained. Finally, the fifth section concludes the paper with some future directions.
2. Related Works
Considering the previous studies in the literature, although there are many studies conducted on TC in other languages,
the number of TC researches conducted in Turkish is very limited. For instance, there are many TC studies conducted in
English, which is one of the most spoken languages in the World [8-10]. In addition to this, there are interesting
researches in the literature performed in some other languages like Arabic that has different morphological properties. In
the study of Hmeidi et al. [11], it is aimed to assign articles written in Arabic into the relevant categories. In the study,
five well-known algorithms in the field of TC are discussed and success rates used by these algorithms are compared
with each other. The other study for Arabic TC proposed by Shaalan and Qudash [12] combines different machine
learning algorithms in order to perform named entity recognition. It is claimed that the success rate of the study exceeds
up to 90% and it gives accurate results. Al-Radaideh et al. [13] conducted a study to detect spam emails composed in
Arabic. It is claimed that they obtained accurate results from 87% of the messages in the dataset by using Graham
statistical filter and rule based filter.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 3
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
The aim of this study is investigating Turkish text categorization. In a study conducted by Güran et al. [14], NB,
Multinominal Naïve Bayes (MNB), J48 and K-NN TC algorithms are evaluated. Their proposed study is based on N-
gram algorithm. They have realized their experimental studies on documents that are either pre-processed or not.
According to the results of experimental evaluation, the worst results are obtained when bi-gram, tri-gram representation
and K-NN algorithm are performed together. J48 classifier gives the best classification results in general.
In another study conducted by Torunoglu et al. [15], importance of pre-processing steps in Turkish TC are observed.
Different pre-processing methods and four TC algorithms NB, MNB, SVM and K-NN are evaluated. Considering
experimental results, it is concluded that the step of pre-processing didn’t show the expected impact on Turkish TC.
Akkuş and Cakıcı [16] suggested that morphological analysis would be a useful method for TC in languages with
semantic richness like Turkish. In the study, contribution of morphological analysis on Turkish TC is studied. First,
stems of the words are identified by using Fixed Length Stemmer method and K-NN, SVM and NB learning algorithms
are evaluated on these stems. Considering the evaluation results conducted on the dataset, using a simple approximation
with first five characters to represent documents instead of results of an expensive morphological analysis gives similar
or better results with much less cost.
In the study of Amasyalı and Beken [17], a different approach regarding TC is presented. They assign words of a text
document into a semantic space they have created. They indicate that representing words in semantic space gives better
results compared to bag of words model. According to the experimental results they have obtained, Linear Regression
Classification Algorithm gives the most successful results.
Amasyalı and Diri [18] have proposed n-gram approach to achieve TC for Turkish language in their study. They have
evaluated NB, SVM, J48 and Random Forrest classification algorithms. As a result of the study, they have suggested
that classification algorithms conducted with bi-gram give better results compared to classification algorithms
conducted with tri-gram. Considering the results of classification algorithms, NB gives more successful results in
determining the author of the text, whereas SVM gives more accurate results in terms of determination of genre of the
text and gender of the author.
Tüfekçi and Uzun [19] have investigated the effect of different term weighting methods to identify the author of the
text. In the texts, different feature vectors of each document are determined by trying different weighting methods after
identification of stems of the words. MNB, SVM, Decision Tree and Random Forrest classification algorithms are
performed on the vectors created and results are compared with each other. According to the experimental results, the
best results are obtained by using SVM algorithm.
In the study of Çataltepe et al. [20], the effect of stem length derived from words in a text on Turkish TC is studied.
They have obtained short stems from long stems using various methods regardless of meaning of the words. They have
aimed to compare accuracy rates formed as a result of classifying vectors weighted using   method and
obtained from stems containing fewer characters. As a result, it is observed that Centroid classification method
conducted with shortened stems gives better results.
In a study conducted by Alparslan et al. [21], it is aimed to conduct information extraction from documents classified
within Turkish language. First, they have extracted word stems using stemming algorithms which are particularly used
for Turkish text documents. Document term matrices are formed by using   weighting method with stems
obtained after pre-processing. Unlike other studies, SVM and Adaptive Neuro Fuzzy classification algorithms are
combined in this study. Considering the experimental results of this method, the method proposed seems to be more
accurate.
In the study of Uysal and Gunal [22], it is indicated that pre-processing is important to TC. Emails and news written
in both English and Turkish are used as the dataset. It is determined that in which way pre-processing methods affect
classification of the text documents. They found that how tokenization, stop-word removal, lowercase conversion and
stemming processes and their various combinations affect accuracy rate of SVM classification algorithms. As a result, it
is seen that some pre-processing methods reduce accuracy rate of classification of text documents, while lowercase
conversion and stop-word removal processes improve accuracy rate of classification of the text documents.
Gunal [23] conducted studies regarding the effect of different feature selection approaches on TC. In these studies, a
hybrid selection method is proposed by combining filter and wrapper feature selection methods. According to these
studies, features obtained by this method gives better results in Turkish TC compared to single selection method.
There are also some other Turkish text analyses and text retrieval studies other than studies performed on Turkish
TC. Özalp et al. [24] conducted studies to detect slang words in news and comments made for articles and columns on
the Internet. They proposed a study that can automatically filter comments made for online articles, magazines and news
texts on the Internet. Unlike most widely used classification studies in the literature, they proposed an irregularity based
approach. This method is suggested to be advantageous in terms of memory management and low counting complexity.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 4
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
In the study of Özgür et al. [25], an anti-spam filtering method developed for Turkish in particular and specific to
agglutinative languages is proposed. The study consists of two separate modules as follows; Learning Module,
Morphology Module. In their study, they have used both Artificial Neural Network and Bayesian Network algorithms.
They claim that they achieved a success rate of 90% to find spam emails in Turkish on the dataset.
Can et al. [26] hypothesized the items that can affect performance of the text retrieval process and tested validity of
these hypotheses one by one. First, they accepted the hypothesis suggesting that creating a stop-word list and removing
these stop-words would affect the retrieval performance; however, according to the tests conducted, this process doesn’t
have a significant impact on the text retrieval process.
Kılıçaslan et al. [27] studied on anaphora resolution on Turkish texts. They compared different methods to identify
pronouns in Turkish texts. In the study, the success of different machine learning algorithms used to analyze Turkish
text documents is evaluated. Considering success rates of anaphora resolution, learning models are suggested to be more
successful than baselines.
3. Materials and methods applied
3.1. Turkish language overview
Turkish belongs to Altaic branch of Ural-Altaic family of languages. The distinctive characteristics of Turkish are vowel
harmony and extensive agglutination that refers to the process of adding suffixes to a stem. It is possible to give the
meaning of a sentence in English by only one word in Turkish. For example, the English sentence “We were not
sleeping” is a single word in Turkish: “sleep” is the stem, and elements meaning “not,” “-ing,” “we,” and “were” are all
suffixed to it: “Uyumuyorduk”. Turkish is derived from the Latin alphabet consisting 8 vowels (a, e, ı, i, o, ö, u, ü) and
21 consonants (b, c, ç, d, f, g, ğ, h, j, k, l, m, n, p, r, s, ş, t, v, y, z) and 7 of these letters are modified from their original
versions in Latin alphabet (ç, ı, ş, ö, ü, ğ, İ).
3.2. Pre-processing
Pre-processing is one of the most important steps in order to prepare text dataset before TC. Tokenization, stop-words
elimination and stemming are the most widely used pre-processing methods. In general, removal-based pre-processing
is firstly conducted. All common separators, operators, punctuations and non-printable characters are removed. Then,
stop-words filtering that aims to filter-out the most frequent words is performed.
Finally, stemming is applied to obtain the stem of a word that is morphological root by removing the suffixes that
present grammatical or lexical information about the word. Stemming process is based on a hypothesis suggesting that
“words with the same stem are included in relatively similar concepts”. Since Turkish is an agglutinative language and
thousands of different words can be derived from a root word, stemming is an important step before performing text
categorization. In the present study, fixed prefix stemming (FPS) [26] approach and a directory based Turkish stemmer
called Zemberek [29] is used. FPS is a pseudo stemming method and it recognizes the first “n” character in the text and
accepts it as the stem. Zemberek is a general-purpose open source NLP toolkit and it includes a suffix dictionary created
for stemming.
3.3. Feature representation and weighting
Machine learning classifiers generally handle text documents as bag of words (BoW). Vector Space Model (VSM) is an
improved version of BoW, where each text document is represented as a vector, and each dimension corresponds to a
separate term (word) [28]. If a term occurs in the document, then its value becomes non-zero in the vector. When it is
considered from TC perspective, the goal is to construct vectors containing features per category by using a training set
of the documents. In VSM, term weighting is a critical step and three major parts that affect the importance of a term in
a text exists as following. Term frequency factor (), the inverse document frequency factor () and document
length normalization. Normalization factor is computed as illustrated in Equation 1.

(1)
Where, each equals   as in the equation 2.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 5
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
 

 (2)
Where, is the  term in the document, .  is the frequency of word in document .  is inverse
document frequency of word in dataset. is the number of documents containing the word . N is the total number
of documents in the dataset.
3.4. Text categorization and selected classifiers
As a general description, the aim of TC is classifying uncategorized documents into predefined categories. If we look
from machine learning perspective, the aim of TC is to learn classifiers from labeled documents and fulfill classification
on unlabeled documents. In the literature, there is a rich collection of machine learning classifiers for TC [4]. The
selection of the best performing classifier depends on different parameters such as number of training examples,
dimensionality of the feature space, feature independence, over-fitting, simplicity and system’s requirements.
Considering the high dimensionality and over-fitting characteristics and related researches conducted on TC, five well-
known TC classifiers (NB, SVM, K-NN, J48, and RF) are selected among all TC classifiers. The detailed information
about each classifier selected is illustrated in the following section.
3.4.1. Naïve Bayes
Naïve Bayes (NB) classifier is a well-known statistical supervised learning algorithm based on Bayes' Theorem [30].
Conditional probabilities are calculated using all training sets to determine that in which category the text document
should be classified. Easy implementation and high performance are important advantages of the NB classifier.
Furthermore, it requires a small amount of training data to estimate the parameters and good results are obtained in most
of the cases. Its main disadvantage is that dependencies between features cannot be modeled. NB is frequently applied
in the areas of medical diagnosis, TC, pattern recognition and target marketing and it gives quite successful results. The
simple equation of NB classifier is illustrated in Equation 3.
    (3)
Where,  is the probability of instance d being in class ,  is the probability of generating instance d
given class ,  is the probability of occurrence of class and is the probability of instance d occurring.
3.4.2. Support Vector Machine
Support vector machine (SVM), which was introduced in 1992, is a classifier based on statistical information theory and
structural risk minimization. SVM algorithm is divided into two algorithms as linear and nonlinear SVM. In linear SVM
algorithm, an infinite number of hyper-planes are created in order to separate data and maximum-margin hyper-plane is
selected among all these hyper-planes. Nonlinear SVM is used when classes are not linearly separable and data is
transferred into a higher dimensional space. In this way, the data becomes linearly separable [31]. The main advantages
of SVM are high accuracy and being robust against over-fitting via structural risk minimization by using a regularization
parameter. SVM classifier can also work well with an appropriate kernel even if data isn’t linearly separable in the base
feature space. Memory-intensive performance, hard interpretation and determination of the regularization and kernel
parameters and choice of kernel are the disadvantages of SVM2. The main applied areas of SVM classifier are TC,
pattern recognition, bioinformatics and hand-written character recognition.
3.4.3. K-Nearest Neighbor
K-Nearest Neighbor (K-NN), which has no training phase, is an instance based lazy learning classification algorithm
[32]. According to this algorithm, categorization process of the document to be categorized is performed by considering
at the closest k neighbor among documents that have certain class labels. In K-NN algorithm, closeness is defined as a
similarity measure such as Euclidean distance. Equation 4 calculates the Euclidean distance between two instances
 and .
  
 (4)
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 6
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
The implementation of KNN is simple and the cost of the learning phase is zero. It is also robust to noisy data.
Despite having these advantages, K-NN has also some drawbacks. Since description of the learned concepts does not
exist, K-NN model cannot be interpreted and determination of the value of parameter K that specifies the number of
nearest neighbors is not easy. Finally, K-NN is computationally expensive to find the k nearest neighbors in high
dimensions.
3.4.4. J48 Decision Tree
Decision tree learning is a supervised learning method that performs classification process to determine the category of
the input document by creating a decision tree over the available training set [33]. In the decision tree created, internal
nodes represent attributes of the dataset, branches represent the attribute values and leaves represent the classification
label, respectively. J48 classifier is a Java implementation of the C4.5 algorithm, which uses divide-and-conquer
approach for growing the decision tree. J48 is quite successful in the area of TC in particular and has advantages such as
having high performance on large datasets and shorter training duration. It builds models that can be easily interpreted
and can work with both categorical and continuous values. The main disadvantage of J48 is that small variation in
training data may lead to different decision trees.
3.4.5. Random Forest
Random Forest (RF) is an ensemble learning method of decision trees proposed by Leo Breiman and Adele Cutler,
which grows many classification trees. Firstly, subspace of features are randomly selected to construct branches of
decision trees [34]. Then, training data is created to be used to generate each individual tree. Finally, RF classification
model is created by combining all individual trees. All input parameters are passed to each individual tree in the forest
for categorization process of a document. Classification label returns from all trees in the forest and the label with
highest vote is selected as predicted outcome.
RF is a highly accurate classifier that runs efficiently on large datasets and can handle thousands of input features
without any deletion. It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data is missing. RF also contains an experimental method for detecting feature interactions. RF
classifier may not run effectively in a dataset including categorical variables with various number of levels, because
random forests are biased in favor of those attributes with more levels.
3.5. Feature selection
In a text classification approach, if too many features exist in a dataset, it may result in over-fitting and accuracy of the
classifier will presumably decrease. Besides, as the number of features increases, performing classification becomes
impossible because of the lack of computational resources. Consequently, it is important to remove redundant and
irrelevant features from dataset before evaluating machine learning algorithms [33]. Feature selection is an important
step to reduce dimensionality and remove irrelevant features. Feature selection methods are categorized as filter-based
and wrapper-based methods. Filter-based methods are based on specific characteristics of the training instances for
selecting some features without applying any learning algorithm. On the other hand, wrapper-based methods attempt to
find features better suited to a pre-defined learning algorithm or classifier. In a classification task, which has a high
dimensionality characteristic, the filter-based methods are usually selected because of their computational eciency.
Therefore, two well-known filter-based feature selection approaches are utilized in this research and details of these
approaches are presented in the remaining part of the section.
3.5.1. Correlation-based feature selection (CFS)
The CFS is a filter-based feature selection method used for evaluating subsets of features on the basis of the simple idea:
“Good feature subsets contain features that are highly correlated with the classification, but contrarily have low
correlation with other features” [35]. Equation 5 calculates the merit of a feature subset S including k features.
 

(5)
Where, 
is the average value of all feature-classification correlations, and  is the average value of all feature-
feature correlations. CFS method is usually utilized with a heuristic search strategy such as best first search, greedy
stepwise and genetic search.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 7
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
3.5.2. Attribute ranking-based feature selection (ARFS)
The idea behind ARFS is separately ranking features according to their predictive capabilities for the category and
selecting the top ranking ones. One of the most widely used methods in the area of machine learning is Information
Gain (IG) based method [36]. For each category and feature, an information gain score to be used in the ranking process
is calculated and top N ranking features are selected as the feature subset. The IG of the feature over the class is
calculated by equation 6 given below.
() =   


 (6)
Where,  is the proportion of the documents in category c over the total number of documents and  is the
proportion of documents in the category c that contain the feature f over the total number of documents.  is the
proportion of the documents containing the feature f over the total number of documents [37].
4. Experimental study
In this section, we present detailed information about the experimental procedure applied, TTC-3600 dataset created and
conducted, the performance evaluation criteria considered and the experimental results obtained.
4.1. Experimental method
The experiments are performed using the implementations of NB, J48, RF, SVM, and K-NN classifiers in the WEKA
(Waikato Environment for Knowledge Analysis) version 3.6.12 [38]. In this study, the default parameters are set for
each WEKA classifier implemented and feature selection method since these parameters give promising experimental
results [2]. For NB with continuous variables, no kernel method for estimation of the distribution is used. For SVM,
non-linear kernel of degree 3 with WEKA’s default settings is utilized. Default RF parameters 100 and 1 are selected,
where the first number is the number of trees and the second number is the random number seed used for each tree.
Furthermore, the default K-NN and J48 classifier parameters are employed in the research. For K-NN, the value of
parameter k is selected as 1, distance weighting is not applied and Euclidean distance is selected as distance function.
Each classifier is tested with 10-folds cross validation, which is a common strategy for classifier performance
estimation. In this strategy, each dataset is split into 10-blocks. One single block is retained as the validation data for
testing the model and the remaining k − 1 blocks are used as training data. The cross-validation process is then repeated
10 times.
In this study, two FS methods correlation-based (CFS) and attribute raking-based (ARFS) are used in order to
evaluate the performance of feature selection methods applied on TTC-3600 dataset. For CFS, CfsSubsetEval evaluator
of WEKA data mining tool with BestFirst search strategy is used to select the best feature subset. For ARFS,
InfoGainAttributeEval evaluator with Ranker search method is utilized to rank features in accordance with their
information gain score. Instead of empirically selecting N features with highest ranking score, all features that involve in
keeping with an information gain score higher than 0 are selected by setting the value of threshold parameter of Ranker
search method to zero.
4.2. TTC-3600 dataset
Since datasets used in other studies are either not accessible or created for different purposes, a new dataset called TTC-
3600 is created. The most important feature of this dataset, which can be widely used in the studies of TC regarding
Turkish news and articles, is being simple to use and well-documented. The dataset consists of a total of 3600
documents including 600 news/texts from 6 categories like economy, culture-arts, health, politics, sports and technology
are obtained from 6 well-known news portals and agencies (Hürriyet3, Posta4, İha5, HaberTürk6, Radikal7, and Zaman8).
Documents of TTC-3600 dataset are collected between May-July 2015 via Rich Site Summary (RSS) feeds from 6
categories of the respective portals. A special RSS Feeder, which allows to collect XML-Format RSS Feeds from any
portal, is developed by using C# programming language on Visual Studio 2013 IDE to fetch the RSS feeds. In the study,
<title> and <description> XML elements of RSS feeds are taken into consideration for text categorization. Since these
items contain unnecessary data for TC, removal-based pre-processing is conducted. All java scripts, HTML tags
(<img>, <a>, <p>, <strong> etc.), operators, punctuations, non-printable characters and irrelevant data such as
advertising are removed.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 8
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
Three additional dataset versions are created on TTC-3600 by implementing different stemming methods. In all
versions of datasets, first removal-based pre-processing, which is explained in section 3.2 in detail, is used. Then
Turkish stop-words that have no discriminatory power (pronouns, prepositions, conjunctions etc.) in regard with TC are
removed from datasets except the original dataset. In this study, a semi-automatically constructed stop-words list [26]
that contains 147 words is utilized.
After completing the process of pre-processing, text documents containing stems in all versions of datasets are
transformed into document-term matrix by utilizing text2arff tool [39], which is a feature extraction software by using
 weighting scheme. Then, each matrix belonging to dataset versions are converted into attribute relation file
format (ARFF) that is a proper format for WEKA to be executed.
Table 1 gives information about TTC-3600 dataset version. In datasets F5-DS and F7-DS, stemming is performed by
using FPS approach and first 5 and 7 characters of the words are selected as stem, respectively. In Zemb-DS dataset,
Zemberek NLP toolkit is used as the stemmer. In Original-DS, F5-DS, F7-DS and Zemb-DS datasets; there are 7,508,
3,209, 4,814 and 5,693 words (features), respectively.
Table 1. TTC-3600 dataset versions
No
Dataset name
Stop-words filtering
Stemmer
Number of documents
Number of features
1
Original-DS
No
No-Stemmer
3,600
7,508
2
F5-DS
Yes
FPS-5
3,600
3,209
3
F7-DS
Yes
FPS-7
3,600
4,814
4
Zemb-DS
Yes
Zemberek
3,600
5,693
The dataset and files have become publicly available in order to have repeatable results for experimental evaluation
on TTC-3600 dataset9. Each version of TTC-3600 dataset includes two types of files in addition to original text files that
are pre-processed. The first file with “.txt” extension contains the names and ids of the features, whereas the second file
in ARFF format that describes a list of instances is sharing a set of features.
4.3. Evaluation criteria
In machine learning domain, there are different evaluation criteria used to evaluate classifiers. All criteria are generated
from a confusion matrix [40], which contains actual and predicted classification information. True positives (TP), true
negatives (TN), false positives (FP) and false negatives (FN) denote the four different prediction outcomes. In this
study, the most accepted evaluation criterion; ACC is utilized. Each criterion is described in the following.
Accuracy (ACC) is the most widely used performance evaluation criterion, which is the ratio of the total number of
class files that are classified correctly. It is calculated by using Equation 7 given below.
 
 (7)
Precision is the proportion of correctly classified class files with faults. Recall is the proportion of correctly classified
class files with faults. Precision and Recall are calculated using Equation 8 and 9, respectively.
  
 (8)
  
 (9)
4.4. Experimental results and discussion
Figure 1 presents ACC evaluation criteria results of all classifiers on TTC-3600 dataset. The aim of experimental studies
performed to form this figure is to evaluate the performance of TC classifiers on dataset versions created by using
different stemming methods. Considering the experimental results, RF is evaluated as the most accurate classifier in
terms of ACC. In addition, the highest ACC value is achieved by RF classifier in all datasets regardless of stemming.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 9
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
ACC values obtained by this classifier are 88.6%, 87.9%, 88.3% and 90.1% for Original-DS, F5-DS, F7-DS and Zemb-
DS datasets, respectively.
Figure 1. Experimental ACC percentage results of classifiers on datasets.
On the other hand, K-NN has the lowest ACC values among all classifiers and its ACC results are below 60%. The
closest criteria results to RF is achieved by SVM (except Zemb-DS), which is a kernel based classifier. NB classifier
gives more accurate criteria results in Zemb-DS dataset compared to SVM.
According to the data presented in Figure 1, there is a 3% ACC difference maximum between evaluation criteria
results of classifiers used in the study on Original-DS dataset and evaluation criteria results on three datasets created
after stemming. Considering that Original-DS, F5-DS, F7-DS and Zemb-DS datasets have 7,508, 3,209, 4,814 and 5,693
features, respectively; the number of features is dropped down dramatically; however, the effect of this reduction on
ACC is found to be 3% maximum. This situation indicates that the accuracy effect of pre-processing on experimental
results conducted on Turkish texts before TC is not promising.
Generally, if we evaluate the success of the stemming methods, classifier evaluation results obtained from F5-DS and
F7-DS datasets, which are subjected to stemming by FPS approach, are worse than classifier evaluation results obtained
from the original datasets (except K-NN). Classifier results on Zemb-DS dataset that is created by using Zemberek NLP
toolkit are better than the results obtained from original dataset (except SVM). As a result, in all TTC-3600 datasets, the
stemming process performed by using Zemberek classifier outperforms all other methods.
Table 2. The number of remaining features after feature selection methods.
#
Without FS
CFS
ARFS
Original-DS
7,508
55
1,684
F5-DS
3,209
35
942
F7-DS
4,818
63
1,241
Zemb-DS
5,693
52
1,551
In this study, after CFS and ARFS methods are performed in order to observe the effect of FS methods, remaining
numbers of features for each dataset are presented in Table 2. As a result of CFS method, which is combined with a
heuristic search strategy best first search, it is eliminated since about 85% to 90% of the features in the datasets are
found to be irrelevant.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 10
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
On the other hand, the number of remaining features is greater than the number of features obtained by CFS after
applying ARFS method. For example, there are 4,818 features in the initial stage of F7-DS dataset, whereas the number
of remaining features at the end of CFS process is 63 and this value is 1,241 at the end of ARFS.
Table 3. The effect of FS methods on the experimental results.
ACC without FS
ACC of CFS
ACC of ARFS
OrgDS
ZembDS
F5DS
F7DS
OrgDS
ZembDS
F5DS
F7DS
OrgDS
ZembDS
F5DS
F7DS
NB
82.94
87.17
82.22
84.03
78.97
80.44
75.25
78.56
82.94
87.19
82.22
84.06
J48
78.06
79.00
77.14
75.50
76.72
78.19
71.67
74.78
78.97
79.39
77.36
75.97
RF
88.53
90.10
87.92
88.25
80.17
81.42
75.44
78.67
88.87
91.03
88.28
88.59
SVM
86.03
84.97
82.39
83.56
69.31
69.61
68.17
69.19
79.53
76.86
74.97
76.92
KNN
52.83
54.00
55.11
52.67
73.11
74.97
69.44
72.56
64.44
65.25
64.33
62.56
Table 3 shows the performance comparison of feature-selection methods in terms of ACC on four datasets. As it can
be seen from Table 3, the ACC performance of the all classifiers except K-NN on all datasets is reduced after applying
CFS method. For example, before the process, ACC values in OrgDS, ZembDS, F5DS and F7DS for NB classifier was
82.94, 87.17, 82.22 and 84.03, respectively; however, they have become 78.97, 80.44, 75.25 and 78.56 after performing
CFS, respectively. A similar decrease is also observed for RF classifier before and after CFS. For J48, since the values
of ACC results are already low, a decrease at minimum level is occurred after CFS.
One of the highest amount of performance decreases is observed in non-linear kernel-based SVM classifier. When
SVM is performed after applying CFS, the decrease observed in the performance of ACC values is around 12%-15%.
SVM classifier, which is one of the state-of-the-art algorithms of today, harnessed the problem of over-fitting via
structural risk minimization by supporting regularization. Any attempt, such as discretization or feature selection will
invalidate the bounds on performance and potentially overwhelms the structural risk minimization principle. Re-
considering the results given in Figure 1, SVM is the only classifier, in which ACC criterion value is not increased, in
Zemb-DS dataset. As a result, SVM has a fairly robust algorithmic design against uninformative features and produces
better results when no selection or reduction is performed.
After performing CFS, the only classifier with significantly increased performance is K-NN. For example, the value
of ACC in OrgDS dataset is increased by about 21% from 52.83% to 73.11%. The main reason of this situation is that
K-NN algorithm is directly affected by a phenomenon called “curse of dimensionality” [41] in the literature in an
environment that has high dimensional properties. More specifically, in high dimensions, Euclidean distance is
ineffective since all vectors are almost at equal distances to the search query vector10. Since the number of features is
significantly decreased after performing CFS, K-NN algorithm has a much more accurate performance compared to its
state with no feature selection.
Consequently, since around 85-90% of the features in TTC-3600 datasets are eliminated after performing CFS and
there are also discriminative features eliminated for categories, ACC performance values of the classifiers except K-NN
are decreased. In addition, it is observed that feature selection implementation of SVM classifier, which is a non-linear
kernel based classifier that can work better in a high dimensional environment, reduces the accuracy.
In addition to the CFS results, considering the ARFS results given in Table 4, after performing ARFS; NB, J48 and
RF classifiers obtained either similar or better ACC results compared to the values obtained from original datasets. For
example, the best ACC value in this study (91.03%) is obtained by performing RF classifier on ZembDS dataset after
applying ARFS. In addition, the ACC values obtained by performing these 3 classifiers after performing ARFS are more
successful compared to the implementation of CFS.
The SVM classifier has a worse performance (7-8%) compared to ACC values in the original dataset after
performing ARFS; on the other hand, it gives much more accurate results and higher ACC values compared to the
results obtained by CFS. This result obtained in TTC-3600 dataset is not surprising when high performance of SVM in a
non-linear based and high dimensional environment is taken into consideration. Because the number of features
remained after performing ARFS is much more than the number of features remained after performing CFS and SVM
that shows better performance in a high dimensional environment.
ACC values of K-NN classifier are increased by about %8-12 in other datasets compared to its values on the original
dataset after applying ARFS; however, its ACC values are decreased compared to CFS. It can be concluded from the
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 11
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
feature selection experiments that the performance of ARFS is superior to CFS except K-NN that showed a great
success with 74.97% ACC value in ZembDS dataset using only 52 features remained after performing CFS.
Accordingly, it can be speculated that even though ACC results of K-NN classifier decrease in a high dimensional
environment, it can be promising when performed with a small number of features or in other words, when different
dimensionality reduction methods are applied.
Finally, RF classifier is more accurate in all stemming steps (F5, F7, Zemberek) applied in TTC-3600 dataset and
feature selection methods (CFS, ARFS) and the best ACC result is obtained in ZembDS dataset after applying ARFS.
5. Conclusion and future works
In this study, intensive experimental studies on Turkish TC, which are very limited compared to other languages, are
employed and all accessible researches in the literature are discussed. A new dataset called TTC-3600, which can be
widely used in the TC studies regarding Turkish news and articles, is created by collecting news from six well-known
news portals and agencies in Turkey and they have become publicly available in order to be used in comparative
experiments by other researchers. Three different versions of TTC-3600 dataset, which are pre-processed (stemming,
stop-word elimination etc.) and can be used in TC studies regarding Turkish news and articles, are also created in
addition to the original dataset and used in the experiments of the study. The detailed information about TTC-3600
dataset is presented in Section 4.2.
Five well-known classifiers NB, SVM, K-NN, J48 and RF within the field of TC are evaluated on TTC-3600 dataset.
Besides, CFS and ARFS feature selection methods are also utilized in order to observe the impacts of feature selection
methods on Turkish TC. The experimental results indicate that in all comparisons performed after pre-processing and
feature selection steps, RF classifier gives more accurate results and the best ACC value 91.03% is obtained in the
dataset version of Zemb-DS after applying ARFS.
In future studies, other TC classifiers, ensemble learning methods, different types of feature selection approaches and
n-gram based dimensionality reduction method can be used in order to perform researches on TTC-3600 dataset in more
detail. Another future work is constructing a new big data set by collecting much more documents and investigation of
horizontally scaled TC by utilizing a library like Hadoop MapReduce [42].
Notes
1. https://en.wikipedia.org/wiki/Unstructured_data.
2. http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf.
3. http://dosyalar.hurriyet.com.tr/rss.
4. http://www.posta.com.tr/rss.
5. http://www.iha.com.tr/rss.html.
6. http://www.haberturk.com/rss.
7. http://www.radikal.com.tr/rss.
8. http://www.zaman.com.tr/rss_rssMainPage.action?sectionId=341.
9. https://github.com/GitCBU/TTC-3600.
10. https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
References
[1] Chen SY and Liu X. The contribution of data mining to information science. Journal of Information Science 2004; 30(6): 550-
558.
[2] Amancio DR, et al. A systematic comparison of supervised classifiers. PloS one 2014; 9(4): 94-137.
[3] Michie D, Spiegelhalter DJ and Taylor CC. Machine learning, neural and statistical classification. USA: Ellis Horwood
Limited, 1994.
[4] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys 2002; 34(5): 147.
[5] Jieming Y, Zhaoyang Q and Liu Z. Improved feature-selection method considering the imbalance problem in text
categorization.The Scientific World Journal 2014.
[6] Onan A. Classifier and feature set ensembles for web page classification. Journal of Information Science 2015,
10.1177/0165551515591724.
[7] Wolpert DH and Macready WG. No free lunch theorem for search. Technical Report SFI-TR-05-010, Santa Fe Institute, 1995.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 12
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
[8] Read, J. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of
the ACL student research workshop, 2005, pp. 43-48.
[9] Zhang P and He Z. Using data-driven feature enrichment of text representation and ensemble technique for sentence-level
polarity classification. Journal of Information Science 2015, 10.1177/0165551515585264.
[10] Cavnar WB and Trenkle JM. N-gram based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on
Docu-ment Analysis and Information Retrieval, 1994, pp.161-175.
[11] Ismail H, et al. Automatic Arabic text categorization: A comprehensive comparative study. Journal of Information Science
2015; 41(1): 114-124.
[12] Shaalan K and Oudah M. A hybrid approach to Arabic named entity recognition. Journal of Information Science 2014; 40(1):
67-87.
[13] Al-Radaideh QA, AlEroud AF and Al-Shawakfa EM. A hybrid approach to detecting alerts in Arabic e-mail messages.
Journal of Information Science 2012; 38(1): 8799.
[14] Güran A, Akyokuş S, Güler N and Gürbüz Z. Turkish text categorization using n-gram words In: Proceedings of the
International Symposium on Innovations in Intelligent Systems and Applications (INISTA), 2009, pp. 369-373.
[15] Torunoğlu D, Çakırman E, Ganiz MC et al. Analysis of preprocessing methods on classification of Turkish texts. In:
Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, 2011, pp. 112-118.
[16] Akkus BK and Ruket C. Categorization of Turkish news documents with morphological analysis. In: Proceedings of the ACL
student research workshop, 2013, pp. 18.
[17] Amasyalı MF and Beken A. Measurement of Turkish word semantic similarity and text categorization application. In:
Proceedings of IEEE Signal Processing and Communications Applications Conference, Antalya, Turkey, 9-11 April 2009.
Newyork: IEEE, pp. 1-4.
[18] Amasyali MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: Natural Language
Processing and Information Systems, Springer Berlin Heidelberg, 2006, pp. 221-226.
[19] Tufekci P and Uzun E. Author detection by using different term weighting schemes. In: Proceedings of IEEE Signal
Processing and Communications Applications Conference (SIU), Trabzon, Turkey, 24-26 April 2013. Newyork: IEEE, pp. 1-4.
[20] Çataltepe Z, Turan Y and Kesgin F. Turkish document classification using shorter roots. In: Proceedings of IEEE Signal
Processing and Communications Applications Conference (SIU), Eskisehir, Turkey, 11-13 June 2007. Newyork: IEEE, pp. 1-
4.
[21] Alparslan E, Karahoca A and Bahşi H. Classification of confidential documents by using adaptive neurofuzzy inference
systems. Procedia Computer Science 2011; 3: 1412-1417.
[22] Uysal AK and Gunal S. The impact of preprocessing on text classification. Information Processing and Management 2014; 50:
104-112.
[23] Gunal S. Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Sciences
2012; 20: 12961311.
[24] Özalp N, Yılmaz G and Ayan U. Novel comment filtering approach based on outlier on streaming data. In: Proceedings of
IEEE Signal Processing and Communications Applications Conference (SIU), Mugla, Turkey, 18-20 April 2012. Newyork:
IEEE, pp. 1-4.
[25] Özgür L, Güngör T and Gürgen F. Adaptive anti-spam filtering for agglutinative languages. Pattern recognition letters 2004;
25(16): 18191831.
[26] Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC and Vursavas OM, Information retrieval on Turkish texts. Journal of the
American Society for Information Science and Technology 2008; 59(3): 407-421.
[27] Kılıçaslan Y, Güner ES and Yıldırım S. Learning-based pronoun resolution for Turkish with a comparative evaluation.
Computer Speech and Language 2009; 23: 311-331.
[28] Salton G and Christopher B. Term-weighting approaches in automatic text retrieval. Information processing & management
1988; 24: 513-523.
[29] Akin AA and Akin MD. Zemberek, an open source NLP framework for Turkic Languages, 2007.
[30] Yildirim P and Birant D. Naive Bayes Classifier for Continuous Variables using Novel Method (NBC4D) and Distributions.
In: Proceedings of IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA)
Proceedings, Alberobello, Italy 23-25 June 2014. Newyork: IEEE, pp. 110115.
[31] Sebastiani F. Text categorization. Text Mining and Its Applications 2005, pp. 109-129.
[32] Aha DW, Kibler D and Albert MK. Instance-based Learning Algorithms. Machine Learning 1991; 6(1): 37-66.
[33] Quinlan JR. C4.5: Programs for Machine Learning. Machine Learning 1993; 16(3): 235-240.
[34] Xu B, Guo X, Ye Y and Cheng J. An Improved Random Forest Classifier for Text Categorization. Journal of Computers 2012;
7(12): 2913-2920.
[35] Hall M. Correlation-based Feature Selection for Machine Learning. PhD thesis, Department of Computer Science, University
of Waikato, New Zealand, 1999, pp. 5174.
[36] Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. Proceedings of the 14th International
Conference on Machine Learning (ICML '97), Nashville, Tenn, USA, 1997.Morgan Kaufmann, pp. 412420.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
Kılınç et al. 13
Journal of Information Science, 2015, pp. 1-13 © The Author(s), DOI: 10.1177/0165551510000000
[37] Youn E, Jeong MK. Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern
Recognition Letters 2009; 30(5): 477485.
[38] Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. San Francisco, CA: Morgan Kaufman,
2005.
[39] Amasyali M, et al. Text2arff: Automatic feature extraction software for Turkish texts. Signal Processing and Communications
Applications Conference (SIU), Diyarbakir, 22-24 April 2010. Newyork: IEEE, pp. 629 - 632.
[40] Kohavi R and Provost F. On Applied Research in Machine Learning. Machine Learning 1998; 30(2-3): 127132.
[41] Kevin B, et al. When is “nearest neighbor” meaningful. In: 7th International Conference on Database Theory, Jerusalem,
Israel, 10-12 January 1999. IEEE, pp. 217-235.
[42] Meijing L, Xiuming Y and Ryu KH. MapReduce-based web mining for prediction of web-user navigation. Journal of
Information Science 2014; 40(5): 557-567.
Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk
... During the experimental phase of the study, we evaluate several models combining two paragraph vector architectures with ad-hoc retrieval. The experiments performed on a well-known Turkish news collection [9] show that the proposed approach can reach up to 93.5% classification accuracy, which results in more accurate predictions than the baseline benchmark methods provided by Kılınç et al. [9]. Furthermore, our results are highly close and comparable to the current state-of-the-art [6]. ...
... During the experimental phase of the study, we evaluate several models combining two paragraph vector architectures with ad-hoc retrieval. The experiments performed on a well-known Turkish news collection [9] show that the proposed approach can reach up to 93.5% classification accuracy, which results in more accurate predictions than the baseline benchmark methods provided by Kılınç et al. [9]. Furthermore, our results are highly close and comparable to the current state-of-the-art [6]. ...
... In this study, we use the public TTC-3600 dataset [9] to develop and evaluate models for Turkish news categorization. As its name implies, the collection consists of 3600 Turkish news articles and their corresponding categories. ...
Article
News categorization, which is a common application area of text classification, is the task of automatic annotation of news articles with predefined categories. In parallel with the rise of deep learning techniques in the field of machine learning, neural embedding models have been widely utilized to capture hidden relationships and similarities among textual representations of news articles. In this study, we approach the Turkish news categorization problem as an ad-hoc retrieval task and investigate the effectiveness of paragraph vector models to compute and utilize document-wise similarities of Turkish news articles. We propose an ensemble categorization approach that consists of three main stages, namely, document processing, paragraph vector learning, and document similarity estimation. Extensive experiments conducted on the TTC-3600 dataset reveal that the proposed system can reach up to 93.5% classification accuracy, which is a remarkable performance when compared to the baseline and state-of-the-art methods. Moreover, it is also shown that the Distributed Bag of Words version of Paragraph Vectors performs better than the Distributed Memory Model of Paragraph Vectors in terms of both accuracy and computational performance.
... Information about the studies examined in the literature is shown in Table 1. [20]. ...
... Since text data obtained from news websites is more formal and likely to yield more accurate results in finding entity relationships, it was selected as the training data for the proposed model. The Turkish TTC-4900 dataset [20], which is commonly used in the literature for text classification tasks, was used for model training. Additionally, the widely used 20Newsgroups dataset [38] and the BBC dataset [39] in the literature for English text classification tasks were used for model performance comparison. ...
... TVQ [32,68] evaluates t j using score function given by Formula (53). ...
... The CNAE dataset [29,30] is made up of 1080 business description documents of nine types of Brazilian Companies. The KDC dataset [77,88,89] and the TTC dataset [53] respectively contain 4007 and 3600 documents of Turkish news and articles. ...
Article
Full-text available
Filter feature selection methods are utilized to select discriminative terms from high-dimensional text data to improve text classification performance and reduce computational costs. This paper aims to provide a comprehensive systematic review of existing filter feature selection methods for text classification. Firstly, we briefly discuss text classification based on filter feature selection. Secondly, we present a detailed discussion on mathematical designs, effectiveness and complexity of existing filter feature selection methods of different methodologies (supervised methods, unsupervised methods and hybrid methods). In addition, a certain number of benchmark datasets for evaluating performance of filter feature selection methods in text classification are also discussion. Finally, we provide future directions in filter feature selection, along with conclusion.
... One investigation into the categorization of Turkish news data, Kılınç et al. [10] created a new dataset called TTC-3600 that may be extensively used in TC research of Turkish news and article content. On TTC-3600, different successful classifiers in the TC domain and successful feature selection methods are evaluated. ...
... The preprocessing tasks for the multiclass classification problem are evaluated using the news datasets. The first dataset is TTC-3600 [10]. Being userfriendly and well-documented is the most crucial aspect of this dataset, which may be extensively employed in TC studies pertaining to Turkish news and articles. ...
Article
In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques are compared on one domain, namely news data, and in two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages are all evaluated in this way. Using best combinations of preprocessing techniques rather than using or not using them all, experimental studies on public datasets reveals that, choosing best combinations of preprocessing techniques can improve classification accuracy significantly.
... We also test our model on other languages newspaper datasets. The BERT model which produces best results in terms of our dataset, we applied BERT model on the TTC-3600 dataset of the Turkish language newspaper provided by Kilinc et al. (2017) and the accuracy was 92.85%. Aian, we applied our best BERT model on other language newspaper provided by Dogru et al. (2021). ...
Article
The rapid increase in obtainable online text data has made text categorization an important tool for data analysts to extract relevant information on the web. However, incorrect or incomplete classification of marginalized groups may result from using biased text data. In order to remedy the disparity in available data, this research suggests a system for classifying and analyzing Bangla news articles. The suggested approach first uses both Random Under-Sampling (RUS) and Synthetic Minority Oversampling Techniques to balance the massive unbalanced Bangla News dataset consisting of 4,37,948 instances (SMOTE). Secondly, the proposed system employs three machine learning models: Logistic Regression, Decision Tree, and Stochastic Gradient Descent along with three deep learning models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Bidirectional Encoder Representations from Transformers (BERT) for Bangla text categorization. The experimental results signify the superior performance of BERT to other classification models of the system as well as other existing methods in this domain. The proposed system achieves the maximum accuracy of 99.04% in balanced dataset and 72.23% in imbalanced dataset using BERT. K-fold cross validation with varied K values is used to determine the performance consistency of BERT. Finally, both LIME (Local Interpretable Model agnostic Explanations and SHAP (SHapley Additive exPlanations) techniques are applied for interpreting each prediction made by BERT.
Article
Full-text available
The selection of discriminative terms from large quantity of terms in text documents is helpful for achieving better accuracy of text classification. To focus on the task of selecting discriminative terms from text, a deep learning based feature selection method is proposed. The method is developed by using the long short term memory (LSTM) network. A deep network based on LSTM is trained in unsupervised manner to extracted deep features from bag-of-words term frequency vectors. The deep features are integrated with term frequencies to evaluate the effectiveness of terms. The proposed method extends the limitation of term frequency information by applying deep features for feature selection. Experiments in nine public datasets demonstrate better performance of our method in selecting discriminative terms than comparative methods.
Article
Full-text available
Since Turkish is an agglutinative language and contains reduplication, idiom, and metaphor words, Turkish texts are sources of information with extremely rich meanings. For this reason, the processing and classification of Turkish texts according to their characteristics is both time-consuming and difficult. In this study, the performances of pre-trained language models for multi-text classification using Autotrain were compared in a 250 K Turkish dataset that we created. The results showed that the BERTurk (uncased, 128 k) language model on the dataset showed higher accuracy performance with a training time of 66 min compared to the other models and the CO2 emission was quite low. The ConvBERTurk mC4 (uncased) model is also the best-performing second language model. As a result of this study, we have provided a deeper understanding of the capabilities of pre-trained language models for Turkish on machine learning.
Article
Full-text available
Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of assigning a web page to a particular predefined category based on labelled data. It serves for several other web mining tasks, such as focused web crawling, web link analysis and contextual advertising. Machine learning and data mining methods have been successfully applied for several web mining tasks, including web page classification. Multiple classifier systems are a promising research direction in machine learning, which aims to combine several classifiers by differentiating base classifiers and/or dataset distributions so that more robust classification models can be built. This paper presents a comparative analysis of four different feature selections (correlation, consistency, information gain and chi-square-based feature selection) and four different ensemble learning methods (Boosting, Bagging, Dagging and Random Subspace) based on four different base learners (naive Bayes, K-nearest neighbour algorithm, C4.5 algorithm and FURIA algorithm). The article examines the predictive performance of ensemble methods for web page classification. The experimental results indicate that feature selection and ensemble learning can enhance the predictive performance of classifiers in web page classification. For the DMOZ-50 dataset, the highest average predictive performance (88.1%) is obtained with the combination of consistency-based feature selection with AdaBoost and naive Bayes algorithms, which is a promising result for web page classification. Experimental results indicate that Bagging and Random Subspace ensemble methods and correlation-based and consistency-based feature selection methods obtain better results in terms of accuracy rates.
Conference Paper
Full-text available
Preprocessing is an important task and critical step in information retrieval and text mining. The objective of this study is to analyze the effect of preprocessing methods in text classification on Turkish texts. We compiled two large datasets from Turkish newspapers using a crawler. On these compiled data sets and using two additional datasets, we perform a detailed analysis of preprocessing methods such as stemming ,stopword filtering and word weighting for Turkish text classification on several different Turkish datasets. We report the results of extensive experiments.
Article
Full-text available
Text categorization or classification (TC) is concerned with placing text documents in their proper category according to their contents. Owing to the various applications of TC and the large volume of text documents uploaded on the Internet daily, the need for such an automated method stems from the difficulty and tedium of performing such a process manually. The usefulness of TC is manifested in different fields and needs. For instance, the ability to automatically classify an article or an email into its right class (Arts, Economics, Politics, Sports, etc.) would be appreciated by individual users as well as companies. This paper is concerned with TC of Arabic articles. It contains a comparison of the five best known algorithms for TC. It also studies the effects of utilizing different Arabic stemmers (light and root-based stemmers) on the effectiveness of these classifiers. Furthermore, a comparison between different data mining software tools (Weka and RapidMiner) is presented. The results illustrate the good accuracy provided by the SVM classifier, especially when used with the light10 stemmer. This outcome can be used in future as a baseline to compare with other unexplored classifiers and Arabic stemmers.
Article
As an important issue in sentiment analysis, sentence-level polarity classification plays a critical role in many opinion-mining applications such as opinion question answering, opinion retrieval and opinion summarization. Employing a supervised learning paradigm to train a classifier from sentences often faces the data sparseness problem owing to the short-length limit introduced to texts. In this article, regarding this problem, we exploit two different feature sets learned from external data sets as additional features to enrich data representation: one is a latent topic feature set obtained using a topic model, and the other is a related word feature set derived using word embeddings. Furthermore, we propose an ensemble approach by using these additional features to guide the design of different members of the ensemble. Experimental results on the public movie review dataset demonstrate that the enriched representations are effective for improving the performance of polarity classification, and the proposed ensemble approach can further improve the overall performance.
Article
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance.
Article
Predicting web user behaviour is typically an application for finding frequent sequence patterns. With the rapid growth of the Internet, a large amount of information is stored in web logs. Traditional frequent-sequence-pattern-mining algorithms are hard pressed to analyse information from within big datasets. In this paper, we propose an efficient way to predict navigation patterns of web users by improving frequent-sequence-pattern-mining algorithms based on the programming model of MapReduce, which can handle huge datasets efficiently. During the experiments, we show that our proposed MapReduce-based algorithm is more efficient than traditional frequent-sequence-pattern-mining algorithms, and by comparing our proposed algorithms with current existed algorithms in web-usage mining, we also prove that using the MapReduce programming model saves time.