ArticlePDF Available

The Impact of Feature Selection on Web Spam Detection

Authors:

Abstract and Figures

Search engine is one of the most important tools for managing the massive amount of distributed web content. Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic one is using classification, i.e., learning a classification model for classifying web pages to spam or non-spam. This work tries to select the best feature set for classification of web spam using imperialist competitive algorithm and genetic algorithm. Imperialist competitive algorithm is a novel optimization algorithm that is inspired by socio-political process of imperialism in the real world. Experiments are carried out on WEBSPAM-UK2007 data set, which show feature selection improves classification accuracy, and imperialist competitive algorithm outperforms GA.
Content may be subject to copyright.
I.J. Intelligent Systems and Applications, 2012, 9, 61-67
Published Online August 2012 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2012.09.08
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
The Impact of Feature Selection on Web Spam
Detection
Jaber Karimpour
Dept. of Computer Science, University of Tabriz, Tabriz, Iran
karimpour@tabrizu.ac.ir
Ali A. Noroozi
Dept. of Computer Science, University of Tabriz, Tabriz, Iran
aliasghar.noroozi@gmail.com
Adeleh Abadi
Dept. of Computer Science, University of Tabriz, Tabriz, Iran
adeleh.abadi@gmail.com
Abstract Search engine is one of the most important
tools for managing the massive amount of distributed
web content. Web spamming tries to deceive search
engines to rank some pages higher than they deserve.
Many methods have been proposed to combat web
spamming and to detect spam pages. One basic one is
using classification, i.e., learning a classification model
for classifying web pages to spam or non-spam. This
work tries to select the best feature set for classification
of web spam using imperialist competitive algorithm
and genetic algorithm. Imperialist competitive
algorithm is a novel optimization algorithm that is
inspired by socio-political process of imperialism in the
real world. Experiments are carried out on WEBSPAM-
UK2007 data set, which show feature selection
improves classification accuracy, and imperialist
competitive algorithm outperforms GA.
Index Terms Web Spam Detection, Feature Selection,
Imperialistic Competitive Algorithm, Genetic
Algorithm
I. Introduction
With the explosive growth of information on the web,
it has become the most successful and giant distributed
computing application today. Billions of web pages are
shared by millions of organizations, universities,
researchers, etc. Web search provides great
functionality for distributing, sharing, organizing, and
retrieving the growing amount of information [1].
Search engines have become more and more important
and are used by millions of people to find necessary
information. It has become very important for a web
page, to be ranked high in the important search engines’
results. As a result many techniques are proposed to
influence ranking and improve the rank of a page. Some
of these techniques are legal and are called Search
Engine Optimization (SEO) techniques, but some are
not legal or ethical and try to deceive ranking
algorithms. These spam pages try to rank pages higher
than they deserve [2].
Web spam refers to web content that get high rank in
search engine results despite low information value.
Spamming not only misleads users, but also imposes
time and space cost to search engine crawlers and
indexers. That is why crawlers try to detect web spam
pages to avoid processing and indexing them.
Content-based spamming methods basically tailor the
contents of the text fields in HTML pages to make spam
pages more relevant to some queries. This kind of
spamming is also called term spamming. There are two
main content spamming techniques, which simply
create synthetic contents containing spam terms:
repeating some important terms and dumping many
unrelated terms [3,4].
Link spamming misuses link structure of the web to
spam pages. There are two main kinds of link
spamming. Out-link spamming tries to boost the hub
score of a page by adding out-links in it pointing to
some authoritative pages. One of the common
techniques of this kind of spamming is directory
cloning, i.e., replicating a large portion of a directory
like Yahoo! in the spam page. In-link spamming refers
to persuading other pages, especially authoritative ones,
to point to the spam page. In order to do this, a
spammer might adopt these strategies: creating a honey
pot, infiltrating a web directory, posting links on user-
generated content, participating in link exchange,
buying expired domains, and creating own spam farm
[2].
Hiding techniques are also used by spammers who
want to conceal or to hide the spamming sentences,
terms, and links so that web users do not see those [3].
Content hiding is used to make spam items invisible.
One simple method is to make the spam terms the same
color as the page background color. In cloaking, spam
62 The Impact of Feature Selection on Web Spam Detection
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
web servers return an HTML document to the user and
a different document to a web crawler. In redirecting, a
spammer can hide the spammed page by automatically
redirecting the browser to another URL as soon as the
page is loaded. In two latter techniques, the spammer
can present the user with the intended content and the
search engine with spam content [5].
Various methods have been proposed to combat web
spamming and to detect spam pages. One important and
basic type of methods is considering web spam
detection as a binary classification problem [4]. In this
kind of methods, some web pages are collected as
training data and labeled as spam or non-spam by an
expert. Then, a classifier model is learned from the
training data. One can use any supervised learning
algorithm to build this model. Further, the model is used
to classify any web page to spam or non-spam. The key
issue is to design features used in learning. Ntoulas et al.
[4] propose some content-based features to detect
content spam. Link-based features are proposed for link
spam detection [6,7]. Liu et al. [8] propose some user
behavior features extracted from access logs of web
server of a page. These features depict user behavior
patterns when reaching a page (spam or non-spam).
These patterns are used to separate spam pages from
non-spam ones, regardless of spamming techniques
used. Erdelyi et al. [9] investigate the tradeoff between
feature generation and spam classification accuracy.
They conclude that more features achieve better
performance; however, the appropriate choice of the
machine learning techniques for classification is
probably more important than devising new complex
features.
Feature selection is the process of finding an optimal
subset of features that contribute significantly to the
classification. Selecting a small subset of features can
decrease the cost and the running time of a
classification system. It may also increase the
classification accuracy because irrelevant or redundant
features are removed [10]. Among the many methods
proposed for feature selection, evolutionary
optimization algorithms such as genetic algorithm (GA)
have gained a lot of attention. Genetic algorithm has
been used as an efficient feature selection method in
many applications [11,16].
In this paper, we incorporate genetic algorithm, and
imperialist competitive algorithm [12] to find an
optimal subset of features of the WEBSPAM-UK2007
data set [13,14]. The selected features are used for
classification of the WEBSPAM-UK2007 data.
The rest of the paper is organized as follows. Section
2 gives a brief introduction of the imperialist
competitive algorithm (ICA). Section 3 describes the
feature selection process by ICA and GA. Experimental
results are discussed in section 4, and finally, section 5
concludes the paper.
II. Imperialistic Competitive Algorithm
The imperialist competitive algorithm is inspired by
imperialism in the real world [12]. Imperialism is the
policy of extending the power of a country beyond its
boundaries and weakening other countries to take
control of them.
Fig. 1 Initialization of the empires: The more colonies an imperialist
possesses, the bigger is its mark [12]
This algorithm starts with an initial society of random
generated countries. Some of the best countries are
selected to be imperialists and others are selected to be
colonies of these imperialists. The power of an empire
which is the counterpart of fitness value in genetic
algorithms, is the power of the imperialist country plus
a percentage of mean power of its colonies. Figure 1
depicts the initialization of the empires.
After assigning all countries to imperialists, and
forming empires, colonies start moving towards the
relevant imperialist (Assimilation). Then, some
countries randomly change position in the search space
(Revolution). After assimilation and revolution, a
colony may get a better position in the search space and
take control the empire (substitution for the imperialist).
Fig. 2 Imperialistic competition: The weakest colony of the weakest
empire is possessed by other empires [12]
Then, imperialistic competition begins. All empires
try to take control of the weakest colony of the weakest
empire. This competition reduces the power of weaker
empires and increases the power of the powerful ones.
Any empire that cannot compete with other empires and
increase its power or at least prevent decreasing it, will
The Impact of Feature Selection on Web Spam Detection 63
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
gradually collapse. As a result, after some iterations, the
algorithm converges and only one imperialist remains
and all other countries are colonies of it. Figure 2
depicts the imperialistic competition. The more
powerful an empire is, the more likely it will take
control of the weakest colony of the weakest empire.
The pseudo code of ICA is as follows:
1. Initialize the empires
2. Assimilation: Move the colonies toward their
relevant imperialist
3. Revolution: Randomly change the
characteristics of some colonies
4. Exchange the position of a colony and
Imperialist. If a colony has more power than
that of imperialist, exchange the positions of
that colony and the imperialist
5. Compute the total power of all empires
6. Imperialistic competition: Give the weakest
colony from the weakest empire to the empire
that has the most likelihood to possess it
7. Eliminate the powerless empires
8. If there is just one empire, stop, else, go to 2
III. Feature Selection
WEBSPAM-UK2007 data set contains 96 content
based features. We use the imperialist competitive and
genetic algorithms to optimize the features that
contribute significantly to the classification.
A. Feature Selection Using ICA
In this section, the steps of feature selection using
ICA are described.
1) Initialize the empires
In the genetic algorithm, each solution to an
optimization problem is an array, called chromosome.
In ICA, this array is called country. In feature selection,
each country is an array of binary numbers. When
country[i] is 1, the ith feature is selected for
classification, and when it is 0, the ith feature is removed
[15]. Figure 3 depicts the feature representation as a
country.
1
F
2
F
3
F
n
F
Country
1
0
1
1
0
Feature subset =
1 3 n-1
{F ,F ,..., F }
Fig. 3 Feature representation as a country in ICA [15]
The power of each country is calculated by F-score.
F-score is a commonly used measure in machine
learning and information retrieval [3,10]. The confusion
matrix of a given a classifier is considered as table 1.
Table 1. Confusion Matrix
Classified spam
Classified non-spam
Actual spam
A
B
Actual non-spam
C
D
F-score is determined as follows
F-score = 1 / (1 / Recall + 1 / Precision) (1)
Where Recall, and Precision are defined as follows
Recall = C / (C + D) (2)
Precision = B / (B + D) (3)
The algorithm starts by randomly initializing a
population of size
pop
N
.
imp
N
of the most powerful
countries are selected as imperialists and form the
empires. The remaining countries (
col
N
) are assigned
to empires based on the power of each empire. The
normalized power of each imperialist is defined by
imp
n
1
n
N
i
i
P
NP
P
(4)
Where
n
P
is the power of
countryn
.
The initial number of colonies of
n
empire
will be
n n col
{ * }NC round NP N
(5)
To assign colonies to empires,
n
NC
of the colonies
is chosen randomly and assigned to
n
imperialist
. These
colonies along with the
n
imperialist
will form
n
empire
.
2) Assimilation
In this phase, colonies move towards the relevant
imperialist. Since feature selection is a discrete problem,
we use following operator for assimilation [15]
For each colony
Create a binary string and assign a random
generated binary to each cell
Copy the cells of the relevant imperialist,
corresponding to the location of ―1‖s in the
binary string, to the same positions in the colony
3) Revolution
The purpose of revolution is preserving and
introducing diversity. It allows the algorithm to avoid
local minimum. Revolution occurs according to a user
defined revolution probability. For each colony, some
cells are selected randomly and their containing binary
is inverted (―1‖ is inverted to ―0‖, and ―0‖ is inverted to
―1‖).
64 The Impact of Feature Selection on Web Spam Detection
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
4) Exchange the positions of a colony and imperialist
After assimilation and revolution, a colony may gain
more power than that of imperialist. As a result, the best
colony of an empire and its imperialist exchange
positions. Then, the algorithm will continue by the
imperialist in a new position.
5) Compute the total power of empires
The total power of an empire is mainly affected by
the power of its imperialist. Another factor in
computing the total power of an empire is the power of
colonies of that empire. Of course, the main power is by
the power of the imperialist, and the power of colonies
has less impact. As a result, we define the total power of
n
empire
is defined
()
{ ( )}
nn
n
power imperialist
mean power colonies of empire
TP
(6)
Where
is a positive factor which is considered to
be less than 1. Decreasing the value of
increases the
role of the imperialist in determining the total power of
an empire and increasing it will increase the role of the
colonies.
6) Imperialistic Competition
In this important phase of the algorithm, the empires
compete to take control of the weakest colony of the
weakest empire. Each empire has a likelihood of
possessing the mentioned colony. The possession
probability of
n
empire
is obtained by
imp
1
n
n
emp N
i
i
TP
P
TP
(7)
As you can notice, the most powerful empire does
not take possession of the weakest colony of the
weakest empire, but it will be more likely to possess the
mentioned colony.
7) Eliminate the powerless empires
Imperialistic competition causes some empires to
lose power and gradually collapse. When an empire
loses all its colonies, we assume it is collapsed and
eliminate it. The imperialist of this powerless empire is
possessed by other empires as a colony.
8) Convergence
As a result of imperialistic competition and
elimination of powerless empires, the algorithm will
converge to the most powerful empire and all the
countries will be under the control of this empire. The
imperialist of this empire will determine the optimal
subset of features selected for classification, because
this imperialist is the most powerful of all countries.
B. Feature selection using GA
In the genetic algorithm, each solution to the feature
selection problem is a string of binary numbers, called
chromosome. When chromosome[i] is 1, the ith feature
is selected for classification, and when it is 0, the ith
feature is not selected [11,16].
The fitness function is considered the accuracy of the
classification model. In this research, we calculate the
fitness value of each chromosome by F-score. F-score
was described in the previous section.
The algorithm starts by randomly initializing a
population of size
pop
N
. Then, crossover and mutation
are done.
Crossover allows the generation of new
chromosomes by combining current best chromosomes.
To do crossover, single point crossover technique is
used, i.e., one crossover point is selected, binary string
from beginning of chromosome to the crossover point is
copied from one parent, the rest is copied from the
second parent. Figure 4 shows how children are
generated from each pair of chromosomes by crossover.
Mutation is similar to revolution in ICA. It maintains
genetic diversity and allows the algorithm to avoid local
minimum. To do mutation, in each chromosome, a
random cell is selected and its containing bit in inverted
(―1‖ is inverted to ―0‖, and ―0‖ is inverted to ―1‖).
Mutation and crossover occur according to a
previously defined mutation and crossover probability.
Genetic algorithm iterates for some user defined
number of generations.
Fig. 4 how children are generated from parents by crossover [17]
IV. Experimental Results
In order to investigate the impact of feature selection
on web spam classification, WEBSPAM-UK2007 data
are used. It is a publicly available web spam data
collection and is based on a crawl of the .uk domain
done in May 2007 [13, 14]. It includes 105 million
pages and over 3 billion links in 114529 hosts.
The training set contains 3849 hosts. This data set
contains content and link based features. In our
experiments, we used only content based features
because they were enough to meet our purposes. The
selected data set contains 3849 data, with 208 spam and
3641 non-spam pages. We partitioned this data set to
two disjoint sets: training data set with 2449 data, and
test data set with 1000 data. After performing feature
The Impact of Feature Selection on Web Spam Detection 65
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
selection using the training set, the test set was used to
evaluate the selected subset of features.
The evaluation of the overall process was based on
weighted f-score which is a suitable measure for the
spam classification problem. It was also used as the
power function in ICA and fitness function in GA.
Bayesian Network, Decision Tree (C4.5 algorithm),
and Support Vector Machine (SVM) were chosen as
learning algorithms to perform the classification and
calculate the weighted F-score. These algorithms are
powerful learning algorithms used in many web spam
detection researches [4, 5, 18].
Following parameters were used for ICA
Number of countries = 100
Number of imperialists = 10
= 0.1
Revolution rate = 0.01
Selected parameters for GA are as follows
Initial population = 100
Number of generations (iterations) = 100
Crossover rate = 0.6
Mutation rate = 0.01
Figure 5 depicts maximum and mean power of all
imperialists versus iteration, using Decision Tress,
SVM, and Bayesian Network classifiers, in ICA. As
shown in this figure, by SVM and Decision Tree
classifiers, the global maximum of the function
(maximum power) is found in less than 5 iterations,
while by Bayesian Network, it is found in 12th iteration.
Fig. 5 Mean and maximum power of all imperialists versus iteration, using different classifiers, in ICA
Figures 6, and 7 compare ICA power function and
GA fitness function versus iteration. Figure 6 shows the
power (fitness) of best answer versus iteration
(generation), using Bayesian Network classifier, in ICA
and GA. As you can see, ICA converges faster than GA,
and has more power than GA in all iterations. Another
important point is that the initial value of f-score which
is the result of random initialization of population in
both algorithms, gets a higher increase by ICA over
iterations. This point shows that imperialistic
competition outperforms genetic evolution in the
problem of spam classification.
Fig. 6 Power (fitness) of best answer versus iteration, using Bayesian Network classifier
66 The Impact of Feature Selection on Web Spam Detection
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
Fig. 7 Mean power (fitness) of all answers versus iteration, using Bayesian Network classifier
Figure 7 depicts mean power (fitness) of all answers
versus iteration, using Bayesian Network classifier, in
ICA and GA. As you can see, ICA gets a higher
increase in mean power of all answers.
The optimal subset of features selected by ICA and
GA are used to train a classification model. This model
is evaluated by the test data set. Evaluation results
obtained for Bayesian Network, Decision Tree, and
SVM classifiers are shown in table 2. These results
indicate that feature selection by both ICA and GA
techniques improves web spam classification.
Furthermore, ICA based feature selection outperforms
GA based feature selection in the problem of web spam
detection.
Table 2 The impact of ICA and GA based feature selection on web spam classification, using different classifiers
Bayesian Network
Decision Tress
SVM
Number
of
features
F-score
Number of
features
F-score
Number of
features
F-score
All features
96
0.854
96
0.935
96
0.937
GA
48
0.876
49
0.950
56
0.939
ICA
41
0.882
56
0.950
61
0.940
V. Conclusion
In this paper, we studied the impact of feature
selection on the problem of web spam classification.
Feature selection was performed by Imperialist
Competitive Algorithm and Genetic Algorithm.
Experimental results showed that selecting an optimal
subset of features increases classification accuracy, but
ICA could find better optimal answers than GA. In fact,
we observed that reducing the number of features
decreases the classification cost and increases the
classification accuracy.
Other optimization methods, such as PSO and ant
colony can be used for feature selection and compared
with ICA and GA in future works.
References
[1] Caverlee J, Liu L, Webb S. A Parameterized
Approach to Spam-Resilient Link Analysis of the
Web. IEEE Transactions on Parallel and Distributed
Systems (TPDS), 2009, 20:1422-1438.
[2] Gyongyi Z,Garcia-Molina H. Web spam taxonomy.
In: First internationalworkshop on adversarial
information retrieval on the web (AIRWeb’05),
Japan, 2005.
[3] Liu B. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data. Springer, 2006.
[4] Ntoulas A, Najork M, Manasse M, et al. Detecting
Spam Web Pages through Content Analysis. In
Proc. of the 15th Intl. World Wide Web Conference
(WWW’06), 2006. 83–92
[5] Wang W, Zeng G, Tang D. Using evidence based
content trust model for spam detection. Expert
Systems with Applications, 2010. 37(8):5599-5606
[6] Becchetti L, Castillo C, Donato D, et al. Link-based
characterization and detection of Web Spam. In
Proc. Of 2nd Int. Workshop on Adversarial
The Impact of Feature Selection on Web Spam Detection 67
Copyright © 2012 MECS I.J. Intelligent Systems and Applications, 2012, 9, 61-67
Information Retrieval on the Web (AIRWeb’06),
Seattle, WA, 2006. 18
[7] Castillo C, Donato D, Gionis A, et al. Know your
neighbors: Web spam detection using the web
topology. In Proc. Of 30th Annu. Int. ACM SIGIR
Conf. Research and Development in Information
Retrieval (SIGIR’07), New York, 2007. 423–430
[8] Liu Y, Cen R, Zhang M, et al. Identifying web
spam with user behavior analysis. In Proc. Of 4th
Int. Workshop on Adversarial Information
Retrieval on the Web (AIRWeb’08), China, 2008.
9-16
[9] Erdelyi M, Garzo A, Benczur A A. Web spam
classification: a few features worth more. In
Proceedings of the 2011 Joint WICOW/AIRWeb
Workshop on Web Quality2011, India, 2011. 27-34.
[10] Han J, Kaber M, Pei J. Data Mining, Concepts and
Techniques. 3rd edn, Morgan Kaufman, 2011.
[11] Vafaie H, De Jong K. Genetic algorithms as a tool
for feature selection in machine learning. In
Proceedings of Fourth International Conference on
Tools with Artificial Intelligence (TAI '92), 1992.
200-203.
[12] Atashpaz-Gargari E, Lucas C. Imperialist
competitive algorithm: An algorithm for
optimization inspired by imperialistic competition.
IEEE Congress on Evolutionary Computation
(CEC 2007), 2007. 4661-4667
[13] Castillo C, Donato D, Becchetti L, et al. A
reference collection for webspam. SIGIR Forum,
2006, 40(2): 1124
[14] Yahoo Research. Web Spam Collections. [cited
2011 May], Available from:
http://barcelona.research.yahoo.net/webspam/datas
ets/, 2007
[15] Mousavi Rad S J, Mollazade K, Akhlagian Tab F.
Application of Imperialist Competitive Algorithm
for Feature Selection: A Case Study on Bulk Rice
Classification. International Journal of Computer
Applications, 2012. 40(16):41-48
[16] Yang J, Honavar V. Feature subset selection using
a genetic algorithm. Intelligent Systems and their
Applications, IEEE, 1998. 13(2):44-49.
[17] Eiben A E, Smith J E. Introduction to Evolutionary
Computing, Springer, 2010.
[18] Araujo L, Martinez-Romo J. Web Spam Detection:
New Classification Features Based on Qualified
Link Analysis and Language Models. IEEE
Transactions on Information Forensics and Security,
2010. 5(3):581-590.
KARIMPOUR Jaber (1974), male, Tabriz, Iran,
Assistant Professor, his research directions include
verification and formal methods.
NOROOZI Ali A. (1986), male, Tabriz, Iran, Master
of Science, his research directions include adversarial
information retrieval and distributed systems.
ABADI Adeleh (1976 - ) female, Tabriz, Iran, Master
of Science, his research directions include verification
and formal methods.
... 4. he number of phishing sites spooing social networking sites increased 125%. 5. Web attacks blocked in average per day in 2011 is 190,370 and in 2012 it increases to 247,350. ...
... hey used WEBSPAM-UK2007 dataset. Feature selection hikes the classiication accuracy [5]. Geng et al. focused on re-extracted features for spam classiication [6]. ...
... hese parameters are adjusted by optimizing performance on a subset (called a validation set) of the training set and cross-validation. 5. Evaluate the accuracy of the learned function. ...
Article
World Wide Web (WWW) is a huge, dynamic, self-organized, and strongly interlinked source of information. Search engine became a vital IR (Information Retrieval) system to retrieve the required information. Results appearing in the first few pages gain more attraction and importance. Since users believe that they were more relevant because of its top positions. Spamdexing plays a key role in making high rank and top visibility for an undeserved page. This paper focus on two aspects: new features and new classifiers. First, 27 new features which are used to commercially boost the ranking and reputation are considered for classification. Along with them 17 new features were proposed and computed. Totally 44 features were combined with the existing WEBSPAM-UK 2007 dataset which is the baseline. With all these features, feature inclusion study is carried out to elevate the performance. Second aspect considered in this paper is exploring new suite of five different machine learners for the web spam classification problem. Results are discussed. New feature inclusion improves the classification accuracy of the publicly available WEBSPAM-UK 2007 features by 22%. SVM outperforms well than the other methods in terms of accuracy.
... Similar to spam in the Web (Karimpour et al. 2012) and e-mail (Zhang et al. 2012;Tseng et al. 2011), spammers have been targeting online social networks (OSNs) such as Facebook, YouTube, and Twitter (Gao et al. 2010;Grier 2010;Benevenuto et al. 2009). There have been a large number of efforts toward combating spam in various online systems including OSNs; refer to Heymann et al. (2007) and Caruana and Li (2012) for surveys on such efforts. ...
... Attribute selection algorithms have also been used to select important attributes for spam detection, especially in the domains of Web spam and e-mail spam. For instance, Karimpour et al. (2012) demonstrated the benefits of attribute selection for Web spam detection. Zhang et al. (2012) used a wrapper-based particle swarm optimization technique to select attributes for an e-mail spam classification system. ...
Article
Full-text available
As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.
... The authors of [28] investigated the impact of feature selection on the problem of web spam detection. This study proposed a new feature selection method called Imperialist Competitive Algorithm and implemented it with Genetic algorithm. ...
Article
Full-text available
Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x 2), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.
... Kariampor et al. [9] performs classification of web spam using imperialist competitive algorithm and genetic algorithm. Imperialist competitive algorithm is a novel optimization algorithm that is inspired by socio-political process of imperialism in the real world. ...
Article
Full-text available
Spamdexing is the art of black hat SEO. Features which are more influential for high rank and visibility are manipulated for the SEO task. The motivation behind the work is utilizing the state of the art Website optimization features to enhance the performance of spamdexing detection. Features which play a focal role in current SEO strategies show a significant deviation for spam and non spam samples. This paper proposes 44 features named as NLSDF (New Link Spamdexing Detection Features). Social media creates an impact in search engine results ranking. Features pertaining to the social media were incorporated with the NLSDF features to boost the recital of the spamdexing classification. The NLSDF features with 44 attributes along with 5 social media features boost the classification performance of the WEBSPAM-UK 2007 dataset. The one tailed paired t-test with 95% confidence, performed on the AUC values of the learning models shows significance of the NLSDF.
... This theory applies on every walk of our life and the same is true for writing, synthesizing and identification in a document [22,26] of a word processing application or searching a query in a web browser. Today, the word processing applications, text editors, web browsers and spam detection [21], etc., on our machines are satisfying human typing needs quite efficiently and the facility of character/word/sentence prediction and recognition [23] becomes a tool to reduce the time of human typing on machines. There is a large prediction support existed on machines for English language and also for the other European languages as well. ...
Article
Full-text available
This work lays down a foundation for text prediction of an inflected and under-resourced language Urdu. The interface developed is not limited to a T9 (Text on 9 keys) application used in embedded devices, which can only predict a word after typing initial characters. It is capable of predicting a word like T9 and also a sequence of word after a word in a continuous manner for fast document typing. It is based on N-gram language model. This stochastic interface deals with three N-gram levels from unary to ternary independently. The uni-gram mode is being in use for applications like T9, while the bi-gram and tri-gram modes are being in use for sentence prediction. The measures include a percentage of keystrokes saved, keystrokes until completion and a percentage of time saved during the typing. Two different corpora are merged to build a sufficient amount of data. The test data is divided into a test and a held out data equally for an experimental purpose. This whole exercise enables the QASKU system outperforms the FastType with almost 15% more saved keystrokes.
... The performance is evaluated in terms of runtime, number of feature selected [28], and the accuracy produced with the various classifiers such as Naï ve Bayes, k-NN, and J48 ...
Article
Full-text available
The technological growth generates the massive data in all the fields. Classifying these high-dimensional data is a challenging task among the researchers. The high-dimensionality is reduced by a technique is known as attribute reduction or feature selection. This paper proposes a genetic algorithm (GA)-based features selection to improve the accuracy of medical data classification. The main purpose of the proposed method is to select the significant feature subset which gives the higher classification accuracy with the different classifiers. The proposed genetic algorithm-based feature selection removes the irrelevant features and selects the relevant features from original dataset in order to improve the performance of the classifiers in terms of time to build the model, reduced dimension and increased accuracy. The proposed method is implemented using MATLAB and tested using the medical dataset with various classifiers namely Naïve Bayes, J48, and k-NN and it is evident that the proposed method outperforms other methods compared.
Conference Paper
Full-text available
Web spam is one among the main problems of search engines because it reduces the standard of the online page. Web spam also effects economically because spammers / attackers provide an oversized free ad data or websites on the search engines that results in a rise within the web traffic. There are certain ways to tell apart such spam pages and one among them is using classification techniques. Relative examination of web spam detection using classification algorithms like, Random Forest and LAD Tree, C4.5 and Naive bayes is presented here during this paper. Analyses were completed on highlight sets of all around acknowledged dataset WEB SPAM UK-2007 utilizing WEKA. When classification was refrained from feature selection some classifier were high on false positive rate and time taken to create model but when feature selection was applied to datasets results were optimized and Random Forest outperformed on all the datasets altogether parameters that were selected
Article
Spam is a basic problem in electronic communications such as email systems in large scales and large number of weblogs and social networks. Due to the problems created by spams, much research has been carried out in this regard by using classification techniques. Redundant and high dimensional information are considered as a serious problem for these classification algorithms due to their high computation costs and using a memory. Reducing feature space results in representing an understandable model and using various methods. In this paper, the method of feature selection by using imperialist competitive algorithm has been presented. Decision tree and SVM classifications have been taken into account in classification phase. In order to prove the efficiency of this method, the results of evaluating data set of Spam Base have been compared with the algorithms proposed in this regard such as genetic algorithm. The results show that this method improves the efficiency of spam detection.
Article
Full-text available
Feature selection plays an important role in pattern recognition. The better selection of a feature set usually results the better performance in a classification problem. This work tries to select the best feature set for classification of rice varieties based on image of bulk samples using imperialist competition algorithm. Imperialist competition algorithm is a new evolutionary optimization method that is inspired by imperialist competition. Results showed the feature set selected by the imperialist competition algorithm provide the better classification performance compared to that obtained by genetic algorithm technique.
Article
Full-text available
In this paper we investigate how much various classes of Web spam features, some requiring very high computational ef-fort, add to the classification accuracy. We realize that ad-vances in machine learning, an area that has received less attention in the adversarial IR community, yields more im-provement than new features and result in low cost yet accu-rate spam filters. Our original contributions are as follows: • We collect and handle a large number of features based on recent advances in Web spam filtering. • We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest sig-nificantly improve accuracy. • We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature sub-set outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features. • We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Chal-lenge data set DC2010. Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.
Conference Paper
Full-text available
Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayesian Learning. The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.
Book
The rapid growth of the Web in the last decade makes it the largest p- licly accessible data source in the world. Web mining aims to discover u- ful information or knowledge from Web hyperlinks, page contents, and - age logs. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three main types: Web structure mining, Web content mining and Web usage mining. Web structure m- ing discovers knowledge from hyperlinks, which represent the structure of the Web. Web content mining extracts useful information/knowledge from Web page contents. Web usage mining mines user access patterns from usage logs, which record clicks made by every user. The goal of this book is to present these tasks, and their core mining - gorithms. The book is intended to be a text with a comprehensive cov- age, and yet, for each topic, sufficient details are given so that readers can gain a reasonably complete knowledge of its algorithms or techniques without referring to any external materials. Four of the chapters, structured data extraction, information integration, opinion mining, and Web usage mining, make this book unique. These topics are not covered by existing books, but yet they are essential to Web data mining. Traditional Web mining topics such as search, crawling and resource discovery, and link analysis are also covered in detail in this book.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
Link-based analysis of the Web provides the basis for many important applications-like Web search, Web-based data mining, and Web page categorization-that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as PageRank.
Conference Paper
This paper proposes an algorithm for optimization inspired by the imperialistic competition. Like other evolutionary ones, the proposed algorithm starts with an initial population. Population individuals called country are in two types: colonies and imperialists that all together form some empires. Imperialistic competition among these empires forms the basis of the proposed evolutionary algorithm. During this competition, weak empires collapse and powerful ones take possession of their colonies. Imperialistic competition hopefully converges to a state in which there exist only one empire and its colonies are in the same position and have the same cost as the imperialist. Applying the proposed algorithm to some of benchmark cost functions, shows its ability in dealing with different types of optimization problems.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.