ArticlePDF Available

The Impact of Feature Selection on Web Spam Detection

August 2012
International Journal of Intelligent Systems and Applications 4(9)

August 2012
4(9)

DOI:10.5815/ijisa.2012.09.08

Authors:

University of Tabriz

Search engine is one of the most important tools for managing the massive amount of distributed web content. Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic one is using classification, i.e., learning a classification model for classifying web pages to spam or non-spam. This work tries to select the best feature set for classification of web spam using imperialist competitive algorithm and genetic algorithm. Imperialist competitive algorithm is a novel optimization algorithm that is inspired by socio-political process of imperialism in the real world. Experiments are carried out on WEBSPAM-UK2007 data set, which show feature selection improves classification accuracy, and imperialist competitive algorithm outperforms GA.

nitialization of the empires: The more colonies an imperialist possesses, the bigger is its  mark [12]

…

mperialistic competition: The weakest colony of the weakest empire is possessed by other empires [12]

…

The impact of ICA and GA based feature selection on web spam classification, using different classifiers

…

depicts maximum and mean power of all imperialists versus iteration, using Decision Tress, SVM, and Bayesian Network classifiers, in ICA. As shown in this figure, by SVM and Decision Tree classifiers, the global maximum of the function (maximum power) is found in less than 5 iterations, while by Bayesian Network, it is found in 12 th iteration.

…

Power (fitness) of best answer versus iteration, using Bayesian Network classifier

…

Content may be subject to copyright.

Content uploaded by Ali A. Noroozi

Content may be subject to copyright.

I.J. Intelligent Systems and Applications, 2012, 9, 61-67

Published Online August 2012 in MECS (http://www.mecs-press.org/)

DOI: 10.5815/ijisa.2012.09.08

The Impact of Feature Selection on Web Spam

Detection

Jaber Karimpour

Dept. of Computer Science, University of Tabriz, Tabriz, Iran

karimpour@tabrizu.ac.ir

Ali A. Noroozi

Dept. of Computer Science, University of Tabriz, Tabriz, Iran

aliasghar.noroozi@gmail.com

Adeleh Abadi

Dept. of Computer Science, University of Tabriz, Tabriz, Iran

adeleh.abadi@gmail.com

Abstract— Search engine is one of the most important

tools for managing the massive amount of distributed

web content. Web spamming tries to deceive search

engines to rank some pages higher than they deserve.

Many methods have been proposed to combat web

spamming and to detect spam pages. One basic one is

using classification, i.e., learning a classification model

for classifying web pages to spam or non-spam. This

work tries to select the best feature set for classification

of web spam using imperialist competitive algorithm

and genetic algorithm. Imperialist competitive

algorithm is a novel optimization algorithm that is

inspired by socio-political process of imperialism in the

real world. Experiments are carried out on WEBSPAM-

UK2007 data set, which show feature selection

improves classification accuracy, and imperialist

competitive algorithm outperforms GA.

Index Terms— Web Spam Detection, Feature Selection,

Imperialistic Competitive Algorithm, Genetic

Algorithm

I. Introduction

With the explosive growth of information on the web,

it has become the most successful and giant distributed

computing application today. Billions of web pages are

shared by millions of organizations, universities,

researchers, etc. Web search provides great

functionality for distributing, sharing, organizing, and

retrieving the growing amount of information [1].

Search engines have become more and more important

and are used by millions of people to find necessary

information. It has become very important for a web

page, to be ranked high in the important search engines’

results. As a result many techniques are proposed to

influence ranking and improve the rank of a page. Some

of these techniques are legal and are called Search

Engine Optimization (SEO) techniques, but some are

not legal or ethical and try to deceive ranking

algorithms. These spam pages try to rank pages higher

than they deserve [2].

Web spam refers to web content that get high rank in

search engine results despite low information value.

Spamming not only misleads users, but also imposes

time and space cost to search engine crawlers and

indexers. That is why crawlers try to detect web spam

pages to avoid processing and indexing them.

Content-based spamming methods basically tailor the

contents of the text fields in HTML pages to make spam

pages more relevant to some queries. This kind of

spamming is also called term spamming. There are two

main content spamming techniques, which simply

create synthetic contents containing spam terms:

repeating some important terms and dumping many

unrelated terms [3,4].

Link spamming misuses link structure of the web to

spam pages. There are two main kinds of link

spamming. Out-link spamming tries to boost the hub

score of a page by adding out-links in it pointing to

some authoritative pages. One of the common

techniques of this kind of spamming is directory

cloning, i.e., replicating a large portion of a directory

like Yahoo! in the spam page. In-link spamming refers

to persuading other pages, especially authoritative ones,

to point to the spam page. In order to do this, a

spammer might adopt these strategies: creating a honey

pot, infiltrating a web directory, posting links on user-

generated content, participating in link exchange,

buying expired domains, and creating own spam farm

[2].

Hiding techniques are also used by spammers who

want to conceal or to hide the spamming sentences,

terms, and links so that web users do not see those [3].

Content hiding is used to make spam items invisible.

One simple method is to make the spam terms the same

color as the page background color. In cloaking, spam

62 The Impact of Feature Selection on Web Spam Detection

web servers return an HTML document to the user and

a different document to a web crawler. In redirecting, a

spammer can hide the spammed page by automatically

redirecting the browser to another URL as soon as the

page is loaded. In two latter techniques, the spammer

can present the user with the intended content and the

search engine with spam content [5].

Various methods have been proposed to combat web

spamming and to detect spam pages. One important and

basic type of methods is considering web spam

detection as a binary classification problem [4]. In this

kind of methods, some web pages are collected as

training data and labeled as spam or non-spam by an

expert. Then, a classifier model is learned from the

training data. One can use any supervised learning

algorithm to build this model. Further, the model is used

to classify any web page to spam or non-spam. The key

issue is to design features used in learning. Ntoulas et al.

[4] propose some content-based features to detect

content spam. Link-based features are proposed for link

spam detection [6,7]. Liu et al. [8] propose some user

behavior features extracted from access logs of web

server of a page. These features depict user behavior

patterns when reaching a page (spam or non-spam).

These patterns are used to separate spam pages from

non-spam ones, regardless of spamming techniques

used. Erdelyi et al. [9] investigate the tradeoff between

feature generation and spam classification accuracy.

They conclude that more features achieve better

performance; however, the appropriate choice of the

machine learning techniques for classification is

probably more important than devising new complex

features.

Feature selection is the process of finding an optimal

subset of features that contribute significantly to the

classification. Selecting a small subset of features can

decrease the cost and the running time of a

classification system. It may also increase the

classification accuracy because irrelevant or redundant

features are removed [10]. Among the many methods

proposed for feature selection, evolutionary

optimization algorithms such as genetic algorithm (GA)

have gained a lot of attention. Genetic algorithm has

been used as an efficient feature selection method in

many applications [11,16].

In this paper, we incorporate genetic algorithm, and

imperialist competitive algorithm [12] to find an

optimal subset of features of the WEBSPAM-UK2007

data set [13,14]. The selected features are used for

classification of the WEBSPAM-UK2007 data.

The rest of the paper is organized as follows. Section

2 gives a brief introduction of the imperialist

competitive algorithm (ICA). Section 3 describes the

feature selection process by ICA and GA. Experimental

results are discussed in section 4, and finally, section 5

concludes the paper.

II. Imperialistic Competitive Algorithm

The imperialist competitive algorithm is inspired by

imperialism in the real world [12]. Imperialism is the

policy of extending the power of a country beyond its

boundaries and weakening other countries to take

control of them.

Fig. 1 Initialization of the empires: The more colonies an imperialist

possesses, the bigger is its  mark [12]

This algorithm starts with an initial society of random

generated countries. Some of the best countries are

selected to be imperialists and others are selected to be

colonies of these imperialists. The power of an empire

which is the counterpart of fitness value in genetic

algorithms, is the power of the imperialist country plus

a percentage of mean power of its colonies. Figure 1

depicts the initialization of the empires.

After assigning all countries to imperialists, and

forming empires, colonies start moving towards the

relevant imperialist (Assimilation). Then, some

countries randomly change position in the search space

(Revolution). After assimilation and revolution, a

colony may get a better position in the search space and

take control the empire (substitution for the imperialist).

Fig. 2 Imperialistic competition: The weakest colony of the weakest

empire is possessed by other empires [12]

Then, imperialistic competition begins. All empires

try to take control of the weakest colony of the weakest

empire. This competition reduces the power of weaker

empires and increases the power of the powerful ones.

Any empire that cannot compete with other empires and

increase its power or at least prevent decreasing it, will

The Impact of Feature Selection on Web Spam Detection 63

gradually collapse. As a result, after some iterations, the

algorithm converges and only one imperialist remains

and all other countries are colonies of it. Figure 2

depicts the imperialistic competition. The more

powerful an empire is, the more likely it will take

control of the weakest colony of the weakest empire.

The pseudo code of ICA is as follows:

1. Initialize the empires

2. Assimilation: Move the colonies toward their

relevant imperialist

3. Revolution: Randomly change the

characteristics of some colonies

4. Exchange the position of a colony and

Imperialist. If a colony has more power than

that of imperialist, exchange the positions of

that colony and the imperialist

5. Compute the total power of all empires

6. Imperialistic competition: Give the weakest

colony from the weakest empire to the empire

that has the most likelihood to possess it

7. Eliminate the powerless empires

8. If there is just one empire, stop, else, go to 2

III. Feature Selection

WEBSPAM-UK2007 data set contains 96 content

based features. We use the imperialist competitive and

genetic algorithms to optimize the features that

contribute significantly to the classification.

A. Feature Selection Using ICA

In this section, the steps of feature selection using

ICA are described.

1) Initialize the empires

In the genetic algorithm, each solution to an

optimization problem is an array, called chromosome.

In ICA, this array is called country. In feature selection,

each country is an array of binary numbers. When

country[i] is 1, the ith feature is selected for

classification, and when it is 0, the ith feature is removed

[15]. Figure 3 depicts the feature representation as a

country.

…

n-1

Country

…

Feature subset =

1 3 n-1

{F ,F ,..., F }

Fig. 3 Feature representation as a country in ICA [15]

The power of each country is calculated by F-score.

F-score is a commonly used measure in machine

learning and information retrieval [3,10]. The confusion

matrix of a given a classifier is considered as table 1.

Table 1. Confusion Matrix

Classified spam

Classified non-spam

Actual spam

Actual non-spam

F-score is determined as follows

F-score = 1 / (1 / Recall + 1 / Precision) (1)

Where Recall, and Precision are defined as follows

Recall = C / (C + D) (2)

Precision = B / (B + D) (3)

The algorithm starts by randomly initializing a

population of size

pop

imp

of the most powerful

countries are selected as imperialists and form the

empires. The remaining countries (

col

) are assigned

to empires based on the power of each empire. The

normalized power of each imperialist is defined by

imp





(4)

Where

is the power of

countryn

The initial number of colonies of

empire

will be

n n col

{ * }NC round NP N

(5)

To assign colonies to empires,

of the colonies

is chosen randomly and assigned to

imperialist

. These

colonies along with the

imperialist

will form

empire

2) Assimilation

In this phase, colonies move towards the relevant

imperialist. Since feature selection is a discrete problem,

we use following operator for assimilation [15]

For each colony

 Create a binary string and assign a random

generated binary to each cell

 Copy the cells of the relevant imperialist,

corresponding to the location of ―1‖s in the

binary string, to the same positions in the colony

3) Revolution

The purpose of revolution is preserving and

introducing diversity. It allows the algorithm to avoid

local minimum. Revolution occurs according to a user

defined revolution probability. For each colony, some

cells are selected randomly and their containing binary

is inverted (―1‖ is inverted to ―0‖, and ―0‖ is inverted to

―1‖).

64 The Impact of Feature Selection on Web Spam Detection

4) Exchange the positions of a colony and imperialist

After assimilation and revolution, a colony may gain

more power than that of imperialist. As a result, the best

colony of an empire and its imperialist exchange

positions. Then, the algorithm will continue by the

imperialist in a new position.

5) Compute the total power of empires

The total power of an empire is mainly affected by

the power of its imperialist. Another factor in

computing the total power of an empire is the power of

colonies of that empire. Of course, the main power is by

the power of the imperialist, and the power of colonies

has less impact. As a result, we define the total power of

empire

is defined

()

{ ( )}

power imperialist

mean power colonies of empire







(6)

Where



is a positive factor which is considered to

be less than 1. Decreasing the value of



increases the

role of the imperialist in determining the total power of

an empire and increasing it will increase the role of the

colonies.

6) Imperialistic Competition

In this important phase of the algorithm, the empires

compete to take control of the weakest colony of the

weakest empire. Each empire has a likelihood of

possessing the mentioned colony. The possession

probability of

empire

is obtained by

imp

emp N





(7)

As you can notice, the most powerful empire does

not take possession of the weakest colony of the

weakest empire, but it will be more likely to possess the

mentioned colony.

7) Eliminate the powerless empires

Imperialistic competition causes some empires to

lose power and gradually collapse. When an empire

loses all its colonies, we assume it is collapsed and

eliminate it. The imperialist of this powerless empire is

possessed by other empires as a colony.

8) Convergence

As a result of imperialistic competition and

elimination of powerless empires, the algorithm will

converge to the most powerful empire and all the

countries will be under the control of this empire. The

imperialist of this empire will determine the optimal

subset of features selected for classification, because

this imperialist is the most powerful of all countries.

B. Feature selection using GA

In the genetic algorithm, each solution to the feature

selection problem is a string of binary numbers, called

chromosome. When chromosome[i] is 1, the ith feature

is selected for classification, and when it is 0, the ith

feature is not selected [11,16].

The fitness function is considered the accuracy of the

classification model. In this research, we calculate the

fitness value of each chromosome by F-score. F-score

was described in the previous section.

The algorithm starts by randomly initializing a

population of size

pop

. Then, crossover and mutation

are done.

Crossover allows the generation of new

chromosomes by combining current best chromosomes.

To do crossover, single point crossover technique is

used, i.e., one crossover point is selected, binary string

from beginning of chromosome to the crossover point is

copied from one parent, the rest is copied from the

second parent. Figure 4 shows how children are

generated from each pair of chromosomes by crossover.

Mutation is similar to revolution in ICA. It maintains

genetic diversity and allows the algorithm to avoid local

minimum. To do mutation, in each chromosome, a

random cell is selected and its containing bit in inverted

(―1‖ is inverted to ―0‖, and ―0‖ is inverted to ―1‖).

Mutation and crossover occur according to a

previously defined mutation and crossover probability.

Genetic algorithm iterates for some user defined

number of generations.

Fig. 4 how children are generated from parents by crossover [17]

IV. Experimental Results

In order to investigate the impact of feature selection

on web spam classification, WEBSPAM-UK2007 data

are used. It is a publicly available web spam data

collection and is based on a crawl of the .uk domain

done in May 2007 [13, 14]. It includes 105 million

pages and over 3 billion links in 114529 hosts.

The training set contains 3849 hosts. This data set

contains content and link based features. In our

experiments, we used only content based features

because they were enough to meet our purposes. The

selected data set contains 3849 data, with 208 spam and

3641 non-spam pages. We partitioned this data set to

two disjoint sets: training data set with 2449 data, and

test data set with 1000 data. After performing feature

The Impact of Feature Selection on Web Spam Detection 65

selection using the training set, the test set was used to

evaluate the selected subset of features.

The evaluation of the overall process was based on

weighted f-score which is a suitable measure for the

spam classification problem. It was also used as the

power function in ICA and fitness function in GA.

Bayesian Network, Decision Tree (C4.5 algorithm),

and Support Vector Machine (SVM) were chosen as

learning algorithms to perform the classification and

calculate the weighted F-score. These algorithms are

powerful learning algorithms used in many web spam

detection researches [4, 5, 18].

Following parameters were used for ICA

Number of countries = 100

Number of imperialists = 10



= 0.1

Revolution rate = 0.01

Selected parameters for GA are as follows

Initial population = 100

Number of generations (iterations) = 100

Crossover rate = 0.6

Mutation rate = 0.01

Figure 5 depicts maximum and mean power of all

imperialists versus iteration, using Decision Tress,

SVM, and Bayesian Network classifiers, in ICA. As

shown in this figure, by SVM and Decision Tree

classifiers, the global maximum of the function

(maximum power) is found in less than 5 iterations,

while by Bayesian Network, it is found in 12th iteration.

Fig. 5 Mean and maximum power of all imperialists versus iteration, using different classifiers, in ICA

Figures 6, and 7 compare ICA power function and

GA fitness function versus iteration. Figure 6 shows the

power (fitness) of best answer versus iteration

(generation), using Bayesian Network classifier, in ICA

and GA. As you can see, ICA converges faster than GA,

and has more power than GA in all iterations. Another

important point is that the initial value of f-score which

is the result of random initialization of population in

both algorithms, gets a higher increase by ICA over

iterations. This point shows that imperialistic

competition outperforms genetic evolution in the

problem of spam classification.

Fig. 6 Power (fitness) of best answer versus iteration, using Bayesian Network classifier

66 The Impact of Feature Selection on Web Spam Detection

Fig. 7 Mean power (fitness) of all answers versus iteration, using Bayesian Network classifier

Figure 7 depicts mean power (fitness) of all answers

versus iteration, using Bayesian Network classifier, in

ICA and GA. As you can see, ICA gets a higher

increase in mean power of all answers.

The optimal subset of features selected by ICA and

GA are used to train a classification model. This model

is evaluated by the test data set. Evaluation results

obtained for Bayesian Network, Decision Tree, and

SVM classifiers are shown in table 2. These results

indicate that feature selection by both ICA and GA

techniques improves web spam classification.

Furthermore, ICA based feature selection outperforms

GA based feature selection in the problem of web spam

detection.

Table 2 The impact of ICA and GA based feature selection on web spam classification, using different classifiers

Bayesian Network

Decision Tress

SVM

Number

features

F-score

Number of

features

F-score

Number of

features

F-score

All features

0.854

0.935

0.937

0.876

0.950

0.939

ICA

0.882

0.950

0.940

V. Conclusion

In this paper, we studied the impact of feature

selection on the problem of web spam classification.

Feature selection was performed by Imperialist

Competitive Algorithm and Genetic Algorithm.

Experimental results showed that selecting an optimal

subset of features increases classification accuracy, but

ICA could find better optimal answers than GA. In fact,

we observed that reducing the number of features

decreases the classification cost and increases the

classification accuracy.

Other optimization methods, such as PSO and ant

colony can be used for feature selection and compared

with ICA and GA in future works.

References

[1] Caverlee J, Liu L, Webb S. A Parameterized

Approach to Spam-Resilient Link Analysis of the

Web. IEEE Transactions on Parallel and Distributed

Systems (TPDS), 2009, 20:1422-1438.

[2] Gyongyi Z,Garcia-Molina H. Web spam taxonomy.

In: First internationalworkshop on adversarial

information retrieval on the web (AIRWeb’05),

Japan, 2005.

[3] Liu B. Web Data Mining, Exploring Hyperlinks,

Contents, and Usage Data. Springer, 2006.

[4] Ntoulas A, Najork M, Manasse M, et al. Detecting

Spam Web Pages through Content Analysis. In

Proc. of the 15th Intl. World Wide Web Conference

(WWW’06), 2006. 83–92

[5] Wang W, Zeng G, Tang D. Using evidence based

content trust model for spam detection. Expert

Systems with Applications, 2010. 37(8):5599-5606

[6] Becchetti L, Castillo C, Donato D, et al. Link-based

characterization and detection of Web Spam. In

Proc. Of 2nd Int. Workshop on Adversarial

The Impact of Feature Selection on Web Spam Detection 67

Information Retrieval on the Web (AIRWeb’06),

Seattle, WA, 2006. 1–8

[7] Castillo C, Donato D, Gionis A, et al. Know your

neighbors: Web spam detection using the web

topology. In Proc. Of 30th Annu. Int. ACM SIGIR

Conf. Research and Development in Information

Retrieval (SIGIR’07), New York, 2007. 423–430

[8] Liu Y, Cen R, Zhang M, et al. Identifying web

spam with user behavior analysis. In Proc. Of 4th

Int. Workshop on Adversarial Information

Retrieval on the Web (AIRWeb’08), China, 2008.

9-16

[9] Erdelyi M, Garzo A, Benczur A A. Web spam

classification: a few features worth more. In

Proceedings of the 2011 Joint WICOW/AIRWeb

Workshop on Web Quality2011, India, 2011. 27-34.

[10] Han J, Kaber M, Pei J. Data Mining, Concepts and

Techniques. 3rd edn, Morgan Kaufman, 2011.

[11] Vafaie H, De Jong K. Genetic algorithms as a tool

for feature selection in machine learning. In

Proceedings of Fourth International Conference on

Tools with Artificial Intelligence (TAI '92), 1992.

200-203.

[12] Atashpaz-Gargari E, Lucas C. Imperialist

competitive algorithm: An algorithm for

optimization inspired by imperialistic competition.

IEEE Congress on Evolutionary Computation

(CEC 2007), 2007. 4661-4667

[13] Castillo C, Donato D, Becchetti L, et al. A

reference collection for webspam. SIGIR Forum,

2006, 40(2): 11–24

[14] Yahoo Research. Web Spam Collections. [cited

2011 May], Available from:

http://barcelona.research.yahoo.net/webspam/datas

ets/, 2007

[15] Mousavi Rad S J, Mollazade K, Akhlagian Tab F.

Application of Imperialist Competitive Algorithm

for Feature Selection: A Case Study on Bulk Rice

Classification. International Journal of Computer

Applications, 2012. 40(16):41-48

[16] Yang J, Honavar V. Feature subset selection using

a genetic algorithm. Intelligent Systems and their

Applications, IEEE, 1998. 13(2):44-49.

[17] Eiben A E, Smith J E. Introduction to Evolutionary

Computing, Springer, 2010.

[18] Araujo L, Martinez-Romo J. Web Spam Detection:

New Classification Features Based on Qualified

Link Analysis and Language Models. IEEE

Transactions on Information Forensics and Security,

2010. 5(3):581-590.

KARIMPOUR Jaber (1974－), male, Tabriz, Iran,

Assistant Professor, his research directions include

verification and formal methods.

NOROOZI Ali A. (1986－), male, Tabriz, Iran, Master

of Science, his research directions include adversarial

information retrieval and distributed systems.

ABADI Adeleh (1976 - ) female, Tabriz, Iran, Master

of Science, his research directions include verification

and formal methods.

Comprehensive Evaluation of Machine Learning Techniques and Novel Features for Web Link Spamdexing Detection

Article

Dec 2014

World Wide Web (WWW) is a huge, dynamic, self-organized, and strongly interlinked source of information. Search engine became a vital IR (Information Retrieval) system to retrieve the required information. Results appearing in the first few pages gain more attraction and importance. Since users believe that they were more relevant because of its top positions. Spamdexing plays a key role in making high rank and top visibility for an undeserved page. This paper focus on two aspects: new features and new classifiers. First, 27 new features which are used to commercially boost the ranking and reputation are considered for classification. Along with them 17 new features were proposed and computed. Totally 44 features were combined with the existing WEBSPAM-UK 2007 dataset which is the baseline. With all these features, feature inclusion study is carried out to elevate the performance. Second aspect considered in this paper is exploring new suite of five different machine learners for the web spam classification problem. Results are discussed. New feature inclusion improves the classification accuracy of the publicly available WEBSPAM-UK 2007 features by 22%. SVM outperforms well than the other methods in terms of accuracy.

Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

Article

Full-text available

Jan 2018

As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.

The Impact of Feature Selection Methods for Classifying Arabic Textual Data

Article

Full-text available

Nov 2019

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x 2), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.

NLSDF FOR BOOSTING THE RECITAL OF WEB SPAMDEXING CLASSIFICATION

Article

Full-text available

Oct 2016

Spamdexing is the art of black hat SEO. Features which are more influential for high rank and visibility are manipulated for the SEO task. The motivation behind the work is utilizing the state of the art Website optimization features to enhance the performance of spamdexing detection. Features which play a focal role in current SEO strategies show a significant deviation for spam and non spam samples. This paper proposes 44 features named as NLSDF (New Link Spamdexing Detection Features). Social media creates an impact in search engine results ranking. Features pertaining to the social media were incorporated with the NLSDF features to boost the recital of the spamdexing classification. The NLSDF features with 44 attributes along with 5 social media features boost the classification performance of the WEBSPAM-UK 2007 dataset. The one tailed paired t-test with 95% confidence, performed on the AUC values of the learning models shows significance of the NLSDF.

A Stochastic Prediction Interface for Urdu

Article

Full-text available

Dec 2014

Qaiser Abbas

This work lays down a foundation for text prediction of an inflected and under-resourced language Urdu. The interface developed is not limited to a T9 (Text on 9 keys) application used in embedded devices, which can only predict a word after typing initial characters. It is capable of predicting a word like T9 and also a sequence of word after a word in a continuous manner for fast document typing. It is based on N-gram language model. This stochastic interface deals with three N-gram levels from unary to ternary independently. The uni-gram mode is being in use for applications like T9, while the bi-gram and tri-gram modes are being in use for sentence prediction. The measures include a percentage of keystrokes saved, keystrokes until completion and a percentage of time saved during the typing. Two different corpora are merged to build a sufficient amount of data. The test data is divided into a test and a held out data equally for an experimental purpose. This whole exercise enables the QASKU system outperforms the FastType with almost 15% more saved keystrokes.

Dimensionality Reduction Using Genetic Algorithm for Improving Accuracy in Medical Diagnosis

Article

Full-text available

Jan 2016

The technological growth generates the massive data in all the fields. Classifying these high-dimensional data is a challenging task among the researchers. The high-dimensionality is reduced by a technique is known as attribute reduction or feature selection. This paper proposes a genetic algorithm (GA)-based features selection to improve the accuracy of medical data classification. The main purpose of the proposed method is to select the significant feature subset which gives the higher classification accuracy with the different classifiers. The proposed genetic algorithm-based feature selection removes the irrelevant features and selects the relevant features from original dataset in order to improve the performance of the classifiers in terms of time to build the model, reduced dimension and increased accuracy. The proposed method is implemented using MATLAB and tested using the medical dataset with various classifiers namely Naïve Bayes, J48, and k-NN and it is evident that the proposed method outperforms other methods compared.

Attribute selection to improve spam classification

Chapter

Jan 2023

Literature review on data analytics for social microblogging platforms

Chapter

Jan 2023

Webspam Detection Using Classification Algorithms and Optimizing the Performance of Classifiers by Selecting the Features

Conference Paper

Full-text available

Oct 2020

Web spam is one among the main problems of search engines because it reduces the standard of the online page. Web spam also effects economically because spammers / attackers provide an oversized free ad data or websites on the search engines that results in a rise within the web traffic. There are certain ways to tell apart such spam pages and one among them is using classification techniques. Relative examination of web spam detection using classification algorithms like, Random Forest and LAD Tree, C4.5 and Naive bayes is presented here during this paper. Analyses were completed on highlight sets of all around acknowledged dataset WEB SPAM UK-2007 utilizing WEKA. When classification was refrained from feature selection some classifier were high on false positive rate and time taken to create model but when feature selection was applied to datasets results were optimized and Random Forest outperformed on all the datasets altogether parameters that were selected

Feature Selection by Using Discrete Imperialist Competitive Algorithm to Spam Detection

Article

Nov 2014

Spam is a basic problem in electronic communications such as email systems in large scales and large number of weblogs and social networks. Due to the problems created by spams, much research has been carried out in this regard by using classification techniques. Redundant and high dimensional information are considered as a serious problem for these classification algorithms due to their high computation costs and using a memory. Reducing feature space results in representing an understandable model and using various methods. In this paper, the method of feature selection by using imperialist competitive algorithm has been presented. Decision tree and SVM classifications have been taken into account in classification phase. In order to prove the efficiency of this method, the results of evaluating data set of Spam Base have been compared with the algorithms proposed in this regard such as genetic algorithm. The results show that this method improves the efficiency of spam detection.

Application of Imperialist Competitive Algorithm for Feature Selection: A Case Study on Bulk Rice Classification

Article

Full-text available

Feb 2012

Feature selection plays an important role in pattern recognition. The better selection of a feature set usually results the better performance in a classification problem. This work tries to select the best feature set for classification of rice varieties based on image of bulk samples using imperialist competition algorithm. Imperialist competition algorithm is a new evolutionary optimization method that is inspired by imperialist competition. Results showed the feature set selected by the imperialist competition algorithm provide the better classification performance compared to that obtained by genetic algorithm technique.

Web spam classification: A few features worth more

Article

Full-text available

Apr 2011

In this paper we investigate how much various classes of Web spam features, some requiring very high computational ef-fort, add to the classification accuracy. We realize that ad-vances in machine learning, an area that has received less attention in the adversarial IR community, yields more im-provement than new features and result in low cost yet accu-rate spam filters. Our original contributions are as follows: • We collect and handle a large number of features based on recent advances in Web spam filtering. • We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest sig-nificantly improve accuracy. • We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature sub-set outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features. • We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Chal-lenge data set DC2010. Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.

Identifying web spam with user behavior analysis

Conference Paper

Full-text available

Apr 2008

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayesian Learning. The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

Book

Jan 2007

Bing Liu

The rapid growth of the Web in the last decade makes it the largest p- licly accessible data source in the world. Web mining aims to discover u- ful information or knowledge from Web hyperlinks, page contents, and - age logs. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three main types: Web structure mining, Web content mining and Web usage mining. Web structure m- ing discovers knowledge from hyperlinks, which represent the structure of the Web. Web content mining extracts useful information/knowledge from Web page contents. Web usage mining mines user access patterns from usage logs, which record clicks made by every user. The goal of this book is to present these tasks, and their core mining - gorithms. The book is intended to be a text with a comprehensive cov- age, and yet, for each topic, sufficient details are given so that readers can gain a reasonably complete knowledge of its algorithms or techniques without referring to any external materials. Four of the chapters, structured data extraction, information integration, opinion mining, and Web usage mining, make this book unique. These topics are not covered by existing books, but yet they are essential to Web data mining. Traditional Web mining topics such as search, crawling and resource discovery, and link analysis are also covered in detail in this book.

Data Mining: Concepts and Techniques

Book

Jan 2012

This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.

Data Mining : Concepts and Technique

Book

Jan 2006

A Parameterized Approach to Spam-Resilient Link Analysis of the Web

Article

Nov 2009

Link-based analysis of the Web provides the basis for many important applications-like Web search, Web-based data mining, and Web page categorization-that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as PageRank.

Imperialist Competitive Algorithm: An Algorithm for Optimization Inspired by Imperialistic Competition

Conference Paper

Oct 2007

This paper proposes an algorithm for optimization inspired by the imperialistic competition. Like other evolutionary ones, the proposed algorithm starts with an initial population. Population individuals called country are in two types: colonies and imperialists that all together form some empires. Imperialistic competition among these empires forms the basis of the proposed evolutionary algorithm. During this competition, weak empires collapse and powerful ones take possession of their colonies. Imperialistic competition hopefully converges to a state in which there exist only one empire and its colonies are in the same position and have the same cost as the imperialist. Applying the proposed algorithm to some of benchmark cost functions, shows its ability in dealing with different types of optimization problems.

Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models

Article

Oct 2010

Data Mining: Concepts and Techniques

Book

Jan 2000

The Impact of Feature Selection on Web Spam Detection

Abstract and Figures

Recommended publications

Combining Feature Selection with Feature Weighting for k-NN Classifier

Modular Feature Selection Using Relative Importance Factors.

Feature Subset Selection Using Genetic Algorithm for Intrusion Detection System

Silhouette-based feature selection for classification of medical images