Conference PaperPDF Available

Spam Filtering with Naive Bayes - Which Naive Bayes?

January 2006

January 2006

Source
DBLP

Conference: CEAS 2006 - The Third Conference on Email and Anti-Spam, July 27-28, 2006, Mountain View, California, USA

Authors:

Vangelis Metsis

Texas State University

Georgios Paliouras

National Center for Scientific Research Demokritos

Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five dierent versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the tempo- ral order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. We adopt an experimental procedure that emulates the incremental training of person- alized spam filters, and we plot roc curves that allow us to compare the dierent versions of nb over the entire tradeo between true positives and true negatives.

Content uploaded by Vangelis Metsis

Content may be subject to copyright.

Spam Filtering with Naive Bayes – Which Naive Bayes? ∗

Vangelis Metsis †

Institute of Informatics and

Telecommunications,

N.C.S.R. “Demokritos”,

Athens, Greece

Ion Androutsopoulos

Department of Informatics,

Athens University of

Economics and Business,

Athens, Greece

Georgios Paliouras

Institute of Informatics and

Telecommunications,

N.C.S.R. “Demokritos”,

Athens, Greece

ABSTRACT

Naive Bayes is very popular in commercial and open-source

anti-spam e-mail ﬁlters. There are, however, several forms

of Naive Bayes, something the anti-spam literature does not

always acknowledge. We discuss ﬁve diﬀerent versions of

Naive Bayes, and compare them on six new, non-encoded

datasets, that contain ham messages of particular Enron

users and fresh spam messages. The new datasets, which

we make publicly available, are more realistic than previous

comparable benchmarks, because they maintain the tempo-

ral order of the messages in the two categories, and they

emulate the varying proportion of spam and ham messages

that users receive over time. We adopt an experimental

procedure that emulates the incremental training of person-

alized spam ﬁlters, and we plot roc curves that allow us to

compare the diﬀerent versions of nb over the entire tradeoﬀ

between true positives and true negatives.

1. INTRODUCTION

Although several machine learning algorithms have been

employed in anti-spam e-mail ﬁltering, including algorithms

that are considered top-performers in text classiﬁcation, like

Boosting and Support Vector Machines (see, for example,

[4, 6, 10, 16]), Naive Bayes (nb) classiﬁers currently appear

to be particularly popular in commercial and open-source

spam ﬁlters. This is probably due to their simplicity, which

makes them easy to implement, their linear computational

complexity, and their accuracy, which in spam ﬁltering is

comparable to that of more elaborate learning algorithms

[2]. There are, however, several forms of nb classiﬁers, and

the anti-spam literature does not always acknowledge this.

In their seminal papers on learning-based spam ﬁlters,

Sahami et al. [21] used a nb classiﬁer with a multi-variate

Bernoulli model (this is also the model we had used in [1]), a

form of nb that relies on Boolean attributes, whereas Pantel

and Lin [19] in eﬀect adopted the multinomial form of nb,

which normally takes into account term frequencies. Mc-

Callum and Nigam [17] have shown experimentally that the

∗This version of the paper contains some minor corrections

in the description of Flexible Bayes, which were made after

the conference.

†Work carried out mostly while at the Department of Infor-

matics, Athens University of Economics and Business.

CEAS 2006 - Third Conference on Email and Anti-Spam, July 27-28, 2006,

Mountain View, California USA

multinomial nb performs generally better than the multi-

variate Bernoulli nb in text classiﬁcation, a ﬁnding that

Schneider [24] and Hovold [12] veriﬁed with spam ﬁlter-

ing experiments on Ling-Spam and the pu corpora [1, 2,

23]. In further work on text classiﬁcation, which included

experiments on Ling-Spam, Schneider [25] found that the

multinomial nb surprisingly performs even better when term

frequencies are replaced by Boolean attributes.

The multi-variate Bernoulli nb can be modiﬁed to accom-

modate continuous attributes, leading to what we call the

multi-variate Gauss nb, by assuming that the values of each

attribute follow a normal distribution within each category

[14]. Alternatively, the distribution of each attribute in each

category can be taken to be the average of several normal

distributions, one for every diﬀerent value the attribute has

in the training data of that category, leading to a nb ver-

sion that John and Langley [14] call Flexible Bayes (fb).

In previous work [2], we found that fb clearly outperforms

the multi-variate Gauss nb on the pu corpora, when the at-

tributes are term frequencies divided by document lengths,

but we did not compare fb against the other nb versions.

In this paper we shed more light on the ﬁve versions of

nb mentioned above, and we evaluate them experimentally

on six new, non-encoded datasets, collectively called Enron-

Spam, which we make publicly available.1Each dataset con-

tains ham (non-spam) messages from a single user of the

Enron corpus [15], to which we have added fresh spam mes-

sages with varying ham-spam ratios. Although a similar

approach was adopted in the public benchmark of the trec

2005 Spam Track, to be discussed below, we believe that

our datasets are better suited to evaluations of personalized

ﬁlters, i.e., ﬁlters that are trained on incoming messages of

a particular user they are intended to protect, which is the

type of ﬁlters the experiments of this paper consider. Un-

like Ling-Spam and the pu corpora, in the new datasets we

maintain the order in which the original messages of the

two categories were received, and we emulate the varying

proportion of ham and spam messages that users receive

over time. This allows us to conduct more realistic exper-

iments, and to take into account the incremental training

of personal ﬁlters. Furthermore, rather than focussing on a

handful of relative misclassiﬁcation costs (cost of false posi-

tives vs. false negatives; λ= 1, 9, 999 in our previous work),

1The Enron-Spam datasets are available from

http://www.iit.demokritos.gr/skel/i-config/ and

http://www.aueb.gr/users/ion/publications.html in

both raw and pre-processed form. Ling-Spam and the pu

corpora are also available from the same addresses.

we plot entire roc curves, which allow us to compare the

diﬀerent versions of nb over the entire tradeoﬀ between true

positives and true negatives.

Note that several publicly available spam ﬁlters appear

to be using techniques described as “Bayesian”, but which

are very diﬀerent from any form of nb discussed in the acad-

emic literature and any other technique that would normally

be called Bayesian therein.2Here we focus on nb versions

published in the academic literature, leaving comparisons

against other “Bayesian” techniques for future work.

Section 2 below presents the event models and assump-

tions of the nb versions we considered. Section 3 explains

how the datasets of our experiments were assembled and

the evaluation methodology we used; it also highlights some

pitfalls that have to be avoided when constructing spam ﬁl-

tering benchmarks. Section 4 then presents and discusses

our experimental results. Section 5 concludes and provides

directions for further work.

2. NAIVE BAYES CLASSIFIERS

As a simpliﬁcation, we focus on the textual content of

the messages. Operational ﬁlters would also consider infor-

mation such as the presence of suspicious headers or token

obfuscation [11, 21], which can be added as additional at-

tributes in the message representation discussed below. Al-

ternatively, separate classiﬁers can be trained for textual

and other attributes, and then form an ensemble [9, 22].

In our experiments, each message is ultimately represented

as a vector hx1,...,xmi, where x1,...,xmare the values of

attributes X1,...,Xm, and each attribute provides infor-

mation about a particular token of the message.3In the

simplest case, all the attributes are Boolean: Xi= 1 if the

message contains the token; otherwise, Xi= 0. Alterna-

tively, their values may be term frequencies (tf), showing

how many times the corresponding token occurs in the mes-

sage.4Attributes with tf values carry more information

than Boolean ones. Hence, one might expect nb versions

that use tf attributes to perform better than those with

Boolean attributes, an expectation that is not always con-

ﬁrmed, as already mentioned. A third alternative we em-

ployed, hereafter called normalized tf, is to divide term

frequencies by the total number of token occurrences in the

message, to take into account the message’s length. The

motivation is that knowing, for example, that “rich” occurs

3 times in a message may be a good indication that the mes-

sage is spam if it is only two paragraphs long, but not if the

message is much longer.

Following common text classiﬁcation practice, we do not

assign attributes to tokens that are too rare (we discard

tokens that do not occur in at least 5 messages of the train-

ing data). We also rank the remaining attributes by in-

formation gain, and use only the mbest, as in [1, 2, 21],

and elsewhere. We experimented with m= 500, 1000, and

3000. Note that the information gain ranking treats the at-

2These techniques derive mostly from P. Graham’s “A plan

for spam”; see http://www.paulgraham.com/spam.html.

3Attributes may also be mapped to character or token n-

grams, but previous attempts to use n-grams in spam ﬁlter-

ing led to contradictory or inconclusive results [2, 12, 19].

4We treat punctuation and other non-alphabetic characters

as separate tokens. Many of these are highly informative as

attributes, because they are more common in spam messages

(especially obfuscated ones) than ham messages; see [2].

tributes as Boolean, which may not be entirely satisfactory

when employing a nb version with non-Boolean attributes.

Schneider [24] experimented with alternative versions of the

information gain measure, intended to be more suitable to

the tf-valued attributes of the multinomial nb. His results,

however, indicate that the alternative versions do not lead

to higher accuracy, although sometimes they allow the same

level of accuracy to be reached with fewer attributes.

From Bayes’ theorem, the probability that a message with

vector ~x =hx1,...,xmibelongs in category cis:

p(c|~x) = p(c)·p(~x |c)

p(~x).

Since the denominator does not depend on the category,

nb classiﬁes each message in the category that maximizes

p(c)·p(~x |c). In the case of spam ﬁltering, this is equivalent

to classifying a message as spam whenever:

p(cs)·p(~x |cs)

p(cs)·p(~x |cs) + p(ch)·p(~x |ch)> T,

with T= 0.5, where chand csdenote the ham and spam cat-

egories. By varying T, one can opt for more true negatives

(correctly classiﬁed ham messages) at the expense of fewer

true positives (correctly classiﬁed spam messages), or vice-

versa. The a priori probabilities p(c) are typically estimated

by dividing the number of training messages of category c

by the total number of training messages. The probabilities

p(~x |c) are estimated diﬀerently in each nb version.

2.1 Multi-variate Bernoulli NB

Let us denote by F={t1,...,tm}the set of tokens that

correspond to the mattributes after attribute selection. The

multi-variate Bernoulli nb treats each message das a set

of tokens, containing (only once) each tithat occurs in

d. Hence, dcan be represented by a binary vector ~x =

hx1,...,xmi, where each xishows whether or not tioc-

curs in d. Furthermore, each message dof category cis

seen as the result of mBernoulli trials, where at each trial

we decide whether or not tiwill occur in d. The prob-

ability of a positive outcome at trial i(tioccurs in d) is

p(ti|c). The multi-variate Bernoulli nb makes the addi-

tional assumption that the outcomes of the trials are inde-

pendent given the category. This is a “naive” assumption,

since word co-occurrences in a category are not indepen-

dent. Similar assumptions are made in all nb versions, and

although in most cases they are over-simplistic, they still

lead to very good performance in many classiﬁcation tasks;

see, for example, [5] for a theoretical explanation. Then,

p(~x |c) can be computed as:

p(~x |c) =

m

i=1

p(ti|c)xi·(1 −p(ti|c))(1−xi),

and the criterion for classifying a message as spam becomes:

p(cs)·m

i=1 p(ti|cs)xi·(1 −p(ti|cs))(1−xi)

c∈{cs,ch}p(c)·m

i=1 p(ti|c)xi·(1 −p(ti|c))(1−xi)> T,

where each p(t|c) is estimated using a Laplacean prior as:

p(t|c) = 1 + Mt,c

2 + Mc

,

and Mt,c is the number of training messages of category c

that contain token t, while Mcis the total number of training

messages of category c.

2.2 Multinomial NB, TF attributes

The multinomial nb with tf attributes treats each mes-

sage das a bag of tokens, containing each one of tias many

times as it occurs in d. Hence, dcan be represented by a

vector ~x =hx1,...,xmi, where each xiis now the number

of occurrences of tiin d. Furthermore, each message dof

category cis seen as the result of picking independently |d|

tokens from Fwith replacement, with probability p(ti|c)

for each ti.5Then, p(~x |c) is the multinomial distribution:

p(~x |c) = p(|d|)· |d|!·

m

i=1

p(ti|c)xi

xi!,

where we have followed the common assumption [17, 24,

25] that |d|does not depend on the category c. This is an

additional over-simplistic assumption, which is more ques-

tionable in spam ﬁltering. For example, the probability of

receiving a very long spam message appears to be smaller

than that of receiving an equally long ham message.

The criterion for classifying a message as spam becomes:

p(cs)·m

i=1 p(ti|cs)xi

c∈{cs,ch}p(c)·m

i=1 p(ti|c)xi> T,

where each p(t|c) is estimated using a Laplacean prior as:

p(t|c) = 1 + Nt,c

m+Nc

,

and Nt,c is now the number of occurrences of token tin the

training messages of category c, while Nc=m

i=1 Nti,c.

2.3 Multinomial NB, Boolean attributes

The multinomial nb with Boolean attributes is the same

as with tf attributes, including the estimates of p(t|c),

except that the attributes are now Boolean. It diﬀers from

the multi-variate Bernoulli nb in that it does not take into

account directly the absence (xi= 0) of tokens from the

message (there is no (1 −p(ti|c))(1−xi)factor), and it esti-

mates the p(t|c) with a diﬀerent Laplacean prior.

It may seem strange that the multinomial nb might per-

form better with Boolean attributes, which provide less in-

formation than tf ones. As Schneider [25] points out, how-

ever, it has been proven [7] that the multinomial nb with

tf attributes is equivalent to a nb version with attributes

modelled as following Poisson distributions in each category,

assuming that the document length is independent of the

category. Hence, the multinomial nb may perform better

with Boolean attributes, if tf attributes in reality do not

follow Poisson distributions.

2.4 Multi-variate Gauss NB

The multi-variate Bernoulli nb can be modiﬁed for real-

valued attributes, by assuming that each attribute follows a

normal distribution g(xi;µi,c, σi,c ) in each category c, where:

g(xi;µi,c, σi,c ) = 1

σi,c√2πe−(xi−µi,c )2

2σ2

i,c ,

and the mean (µi,c) and typical deviation (σi,c) of each dis-

tribution are estimated from the training data. Then, as-

5In eﬀect, this is a unigram language model. Additional

variants of the multinomial nb can be formed by using n-

gram language models instead [20]. See also [13] for other

improvements that can be made to the multinomial nb.

suming again that the values of the attributes are indepen-

dent given the category, we get:

p(~x |c) =

m

i=1

g(xi;µi,c, σi,c ),

and the criterion for classifying a message as spam becomes:

p(cs)·m

i=1 g(xi;µi,cs, σi,cs)

c∈{cs,ch}p(c)·m

i=1 g(xi;µi,c, σi,c )> T.

This allows us to use normalized tf attributes, whose val-

ues are (non-negative) reals, unlike the tf attributes of the

multinomial nb. Real-valued attributes, however, may not

follow normal distributions. With our normalized tf at-

tributes, there is also the problem that negative values are

not used, which leads to a signiﬁcant loss of probability mass

in the (presumed) normal distributions of attributes whose

variances are large and means are close to zero.

2.5 Flexible Bayes

Instead of using a single normal distribution for each at-

tribute per category, fb models p(xi|c) as the average of

Li,c normal distributions with diﬀerent mean values, but the

same typical deviation:

p(xi|c) = 1

Li,c ·

Li,c

l=1

g(xi;µi,c,l, σc),

where Li,c is the number of diﬀerent values Xihas in the

training data of category c. Each of these values is used as

the mean µi,c,l of a normal distribution of that category. All

the distributions of a category care taken to have the same

typical deviation, estimated as σc=1

√Mc, where Mcis again

the number of training messages in c. Hence, the distrib-

utions of each category become narrower as more training

messages of that category are accumulated; in the case of our

normalized tf attributes, this also alleviates the problem of

probability mass loss of the multi-variate Gauss nb. By

averaging several normal distributions, fb can approximate

the true distributions of real-valued attributes more closely

than the multi-variate Gauss nb, when the assumption that

the attributes follow normal distributions is violated.

The computational complexity of all ﬁve nb versions is

O(m·N) during training, where Nis the total number of

training messages. At classiﬁcation time, the computational

complexity of the ﬁrst four versions is O(m), while the com-

plexity of fb is O(m·N), because of the need to sum the

Lidistributions. Consult [2] for further details.

3. DATASETS AND METHODOLOGY

There has been signiﬁcant eﬀort to generate public bench-

mark datasets for anti-spam ﬁltering. One of the main con-

cerns is how to protect the privacy of the users (senders and

receivers) whose ham messages are included in the datasets.

The ﬁrst approach is to use ham messages collected from

freely accessible newsgroups, or mailing lists with public

archives. Ling-Spam, the earliest of our benchmark datasets,

follows this approach [23]. It consists of spam messages re-

ceived at the time and ham messages retrieved from the

archives of the Linguist list, a moderated and, hence, spam-

free list about linguistics. Ling-Spam has the disadvan-

tage that its ham messages are more topic-speciﬁc than the

messages most users receive. Hence, it can lead to over-

optimistic estimates of the performance of learning-based

spam ﬁlters. The SpamAssassin corpus is similar, in that

its ham messages are publicly available; they were collected

from public fora, or they were donated by users with the un-

derstanding they may be made public.6Since they were re-

ceived by diﬀerent users, however, SpamAssassin’s ham mes-

sages are less topic-speciﬁc than those a single user would

receive. Hence, the resulting dataset is inappropriate for

experimentation with personalized spam ﬁlters.

An alternative solution to the privacy problem is to dis-

tribute information about each message (e.g., the frequen-

cies of particular words in each message), rather than the

messages themselves. The Spambase collection follows this

approach. It consists of vectors, each representing a single

message (spam or ham), with each vector containing the

values of pre-selected attributes, mostly word frequencies.

The same approach was adopted in a corpus developed for a

recently announced ecml-pkdd 2006 challenge.7Datasets

that adopt this approach, however, are much more restric-

tive than Ling-Spam and the SpamAssassin corpus, because

their messages are not available in raw form, and, hence, it

is impossible to experiment with attributes other than those

chosen by their creators.

A third approach is to release benchmarks each consist-

ing of messages received by a particular user, after replacing

each token by a unique number in all the messages. The

mapping between tokens and numbers is not released, mak-

ing it extremely diﬃcult to recover the original messages,

other than perhaps common words and phrases therein. This

bypasses privacy problems, while producing messages whose

token sequences are very close, from a statistical point of

view, to the original ones. We have used this encoding

scheme in the pu corpora [1, 2, 23]. However, the loss of

the original tokens still imposes restrictions; for example, it

is impossible to experiment with diﬀerent tokenizers.

Following the Enron investigation, the personal ﬁles of ap-

proximately 150 Enron employees were made publicly avail-

able.8The ﬁles included a large number of personal e-mail

messages, which have been used to create e-mail classiﬁ-

cation benchmarks [3, 15], including a public benchmark

corpus for the trec 2005 Spam Track.9During the con-

struction of the latter benchmark, several spam ﬁlters were

employed to weed spam out of the Enron message collection.

The collection was then augmented with spam messages col-

lected in 2005, leading to a benchmark with 43,000 ham and

approximately 50,000 spam messages. The 2005 Spam Track

experiments did not separate the resulting corpus into per-

sonal mailboxes, although such a division might have been

possible via the ‘To:’ ﬁeld. Hence, the experiments corre-

sponded to the scenario where a single ﬁlter is trained on a

collection of messages received by many diﬀerent users, as

opposed to using personalized ﬁlters.

As we were more interested in personalized spam ﬁlters,

we focussed on six Enron employees who had large mail-

6The SpamAssassin corpus and Spambase are available

from http://www.spamassassin.org/publiccorpus/ and

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

7See http://www.ecmlpkdd2006.org/challenge.html.

8See http://fercic.aspensys.com/members/manager.asp.

9Consult http://plg.uwaterloo.ca/ gvcormac/spam/ for

further details. We do not discuss the other three corpora

of the 2005 Spam Track, as they are not publicly available.

boxes. More speciﬁcally, we used the mailboxes of employees

farmer-d,kaminski-v,kitchen-l,williams-w3,beck-s,

and lokay-m, in the cleaned-up form provided by Bekker-

man [3], which includes only ham messages.10 We also used

spam messages obtained from four diﬀerent sources: (1) the

SpamAssassin corpus, (2) the Honeypot project,11 (3) the

spam collection of Bruce Guenter (bg),12 and spam collected

by the third author of this paper (gp).

The ﬁrst three spam sources above collect spam via traps

(e.g., e-mail addresses published on the Web in a way that

makes it clear to humans, but not to crawlers, that they

should not be used), resulting in multiple copies of the same

messages. We applied a heuristic to the spam collection we

obtained from each one of the ﬁrst three spam sources, to

identify and remove multiple copies; the heuristic is based

on the number of common text lines in each pair of spam

messages. After removing duplicates, we merged the spam

collections obtained from sources 1 and 2, because the mes-

sages from source 1 were too few to be used on their own

and did not include recent spam, whereas the messages from

source 2 were fresher, but they covered a much shorter pe-

riod of time. The resulting collection (dubbed sh; SpamAs-

sassin spam plus Honeypot) contains messages sent between

May 2001 and July 2005. From the third spam source (bg)

we kept messages sent between August 2004 and July 2005,

a period ending close to the time our datasets were con-

structed. Finally, the fourth spam source is the only one

that does not rely on traps. It contains all the spam mes-

sages received by gp between December 2003 and September

2005; duplicates were not removed in this case, as they are

part of a normal stream of incoming spam.

The six ham message collections (six Enron users) were

each paired with one of the three spam collections (sh,bg,

gp). Since the vast majority of spam messages are not per-

sonalized, we believe that mixing ham messages received

by one user with spam messages received by others leads

to reasonable benchmarks, provided that additional steps

are taken, as discussed below. The same approach can be

used in future to replace the spam messages of our datasets

with fresher ones. We also varied the ham-spam ratios, by

randomly subsampling the spam or ham messages, where

necessary. In three of the resulting benchmark datasets, we

used a ham-spam ratio of approximately 3:1, while in the

other three we inverted the ratio to 1:3. The total number

of messages in each dataset is between ﬁve and six thousand.

The six datasets emulate diﬀerent situations faced by real

users, allowing us to obtain a more complete picture of the

performance of learning-based ﬁlters. Table 1 summarizes

the characteristics of the six datasets. Hereafter, we refer

to the ﬁrst, second, . . ., sixth dataset of Table 1 as Enron1,

Enron2, . . . , Enron6, respectively.

In addition to what was mentioned above, the six datasets

were subjected to the following pre-processing steps. First,

we removed messages sent by the owner of the mailbox (we

checked if the address of the owner appeared in the ‘To:’,

‘Cc:’, or ‘Bcc:’ ﬁelds), since we believe e-mail users are in-

creasingly adopting better ways to keep copies of outgoing

messages. Second, as a simpliﬁcation, we removed all html

tags and the headers of the messages, keeping only their

10The mailboxes can be downloaded from http://www.cs.

umass.edu/∼ronb/datasets/enron flat.tar.gz.

11Consult http://www.projecthoneypot.org/.

12See http://untroubled.org/spam/.

Table 1: Composition of the six benchmark datasets.

ham + spam ham:spam ham, spam periods

farmer-d +gp 3672:1500 [12/99, 1/02], [12/03, 9/05]

kaminski-v +sh 4361:1496 [12/99, 5/01], [5/01, 7/05]

kitchen-l +bg 4012:1500 [2/01, 2/02], [8/04, 7/05]

williams-w3 +gp 1500:4500 [4/01, 2/02], 12/03, 9/05]

beck-s +sh 1500:3675 [1/00, 5/01], [5/01, 7/05]

lokay-m +bg 1500:4500 [6/00, 3/02], [8/04, 7/05]

subjects and bodies. In operational ﬁlters, html tags and

headers can provide additional useful attributes, as men-

tioned above; hence, our datasets lead to conservative esti-

mates of the performance of operational ﬁlters. Third, we

removed spam messages written in non-Latin character sets,

because the ham messages of our datasets are all written in

Latin characters, and, therefore, non-Latin spam messages

would be too easy to identify; i.e., we opted again for harder

datasets, that lead to conservative performance estimates.

One of the main goals of our evaluation was to emulate

the situation that a new user of a personalized learning-

based anti-spam ﬁlter faces: the user starts with a small

amount of training messages, and retrains the ﬁlter as new

messages arrive. As noted in [8], this incremental retraining

and evaluation diﬀers signiﬁcantly from the cross-validation

experiments that are commonly used to measure the perfor-

mance of learning algorithms, and which have been adopted

in many previous spam ﬁltering experiments, including our

own [2]. There are several reasons for this, including the

varying size of the training set, the increasingly more so-

phisticated tricks used by spam senders over time, the vary-

ing proportion of spam to ham messages in diﬀerent time

periods, which makes the estimation of priors diﬃcult, and

the topic shift of spam messages over time. Hence, an incre-

mental retraining and evaluation procedure that also takes

into account the characteristics of spam that vary over time

is essential when comparing diﬀerent learning algorithms in

spam ﬁltering. In order to realize this incremental proce-

dure with the use of our six datasets, we needed to order the

messages of each dataset in a way that preserves the original

order of arrival of the messages in each category; i.e., each

spam message must be preceded by all spam messages that

arrived earlier, and the same applies to ham messages. The

varying ham-ratio ratio over time also had to be emulated.

(The reader is reminded that the spam and ham messages

of each dataset are from diﬀerent time periods. Hence, one

cannot simply use the dates of the messages.) This was

achieved by using the following algorithm in each dataset:

1. Let Sand Hbe the sets of spam and ham messages of

the dataset, respectively.

2. Order the messages of Hby time of arrival.

3. Insert |S|spam slots between the ordered messages of

Hby |S|independent random draws from {1,...,|H|}

with replacement. If the outcome of a draw is i, a new

spam slot is inserted after the i-th ham message. A

ham message may thus be followed by several slots.

4. Fill the spam slots with the messages of S, by iter-

atively ﬁlling the earliest empty spam slot with the

oldest message of Sthat has not been placed to a slot.

The actual dates of the messages are then discarded, and

we assume that the messages (ham and spam) of each dataset

Enron1 - ham:spam ratio per batch

1

1.5

2

2.5

3

3.5

4

4.5

1

5

9

13

17

21

25

29

33

37

41

45

49

batch number

Figure 1: Fluctuation of the ham-spam ratio.

arrive in the order produced by the algorithm above. Fig-

ure 1 shows the resulting ﬂuctuation of the ham-spam ratio

over batches of 100 adjacent messages each. The ﬁrst batch

contains the “oldest” 100 messages, the second one the 100

messages that “arrived” immediately after those of the ﬁrst

batch, etc. The ham-spam ratio in the entire dataset is 2.45.

In each ordered dataset, the incremental retraining and

evaluation procedure was implemented as follows:

1. Split the sequence of messages into batches b1,...,bl

of kadjacent messages each, preserving the order of

arrival. Batch blmay have less than kmessages.

2. For i= 1 to l−1, train the ﬁlter (including attribute

selection) on the messages of batches 1,...,i, and test

it on the messages of bi+1.

Note that at the end of the evaluation, each message of

the dataset (excluding b1) will have been classiﬁed exactly

once. The number of true positives (TP) is the number of

spam messages that have been classiﬁed as spam, and sim-

ilarly for false positives (FP, ham misclassiﬁed as spam),

true negatives (TN , correctly classiﬁed ham), and false neg-

atives (FN , spam misclassiﬁed as ham). We set k= 100,

which emulates the situation where the ﬁlter is retrained

every 100 new messages.13 We assume that the user marks

as false negatives spam messages that pass the ﬁlter, and in-

spects periodically for false positives a “spam” folder, where

messages identiﬁed by the ﬁlter as spam end up.

In our evaluation, we used spam recall ( TP

TP+FN ) and ham

recall ( TN

TN +FP ). Spam recall is the proportion of spam mes-

sages that the ﬁlter managed to identify correctly (how much

spam it blocked), whereas ham recall is the proportion of

ham messages that passed the ﬁlter. Spam recall is the com-

plement of spam misclassiﬁcation rate, and ham recall the

complement of ham misclassiﬁcation rate, the two measures

that were used in the trec 2005 Spam Track. In order to

evaluate the diﬀerent nb versions across the entire tradeoﬀ

between true positives and true negatives, we present the

evaluation results by means of roc curves, plotting sensi-

tivity (spam recall) against 1−speciﬁcity (the complement

of ham recall, or ham misclassiﬁcation rate). This is the

13An nb-based ﬁlter can easily be retrained on-line, immedi-

ately after receiving each new message. We chose k= 100

to make it easier to add in future work additional experi-

ments with other learning algorithms, such as svms, which

are computationally more expensive to train.

nb version Enr1 Enr2 Enr3 Enr4 Enr5 Enr6

fb 7.87 3.46 1.43 1.31 0.11 0.34

mv Gauss 5.56 4.75 1.97 12.7 3.36 5.27

mn tf 0.88 0.95 0.20 0.50 0.75 0.18

mv Bernoulli 2.10 0.95 1.09 0.45 1.14 0.88

mn Boolean 2.31 1.97 2.04 0.43 0.39 0.20

Table 2: Maximum diﬀerence (×100) in spam recall

across 500, 1000, 3000 attributes for T= 0.5.

nb version Enr1 Enr2 Enr3 Enr4 Enr5 Enr6

fb 0.61 0.23 1.72 0.54 0.48 0.34

mv Gauss 1.17 0.75 5.94 1.77 5.91 4.88

mn tf 2.17 1.38 1.02 0.61 1.70 1.22

mv Bernoulli 1.47 0.63 6.37 2.04 2.11 1.22

mn Boolean 0.53 0.68 0.10 0.48 1.36 2.17

Table 3: Maximum diﬀerence (×100) in ham recall

across 500, 1000, 3000 attributes for T= 0.5.

normal deﬁnition of roc analysis, when treating spam as

the positive and ham as the negative class.

The roc curves capture the overall performance of the

diﬀerent nb versions in each dataset, but fail to provide

a picture of the progress made by each nb version during

the incremental procedure. For this reason, we additionally

examine the learning curves of the ﬁve methods in terms of

the two measures for T= 0.5, i.e., we plot spam and ham

recall as the training set increases during the incremental

retraining and evaluation procedure.

4. EXPERIMENTAL RESULTS

4.1 Size of attribute set

We ﬁrst examined the impact of the number of attributes

on the eﬀectiveness of the ﬁve nb versions.14 As mentioned

above, we experimented with 500, 1000, and 3000 attributes.

The full results of these experiments (not reported here) in-

dicate that overall the best results are achieved with 3000

attributes, as one might have expected. The diﬀerences in

eﬀectiveness across diﬀerent numbers of attributes, however,

are rather insigniﬁcant. As an example, Tables 2 and 3 show

the maximum diﬀerences in spam and ham recall, respec-

tively, across the three sizes of the attribute set, for each nb

version and dataset, with T= 0.5; note that the diﬀerences

are in percentage points. The tables show that the diﬀer-

ences are very small in all ﬁve nb versions for this threshold

value, and we obtained very similar results for all thresholds.

Consequently, in operational ﬁlters the diﬀerences in eﬀec-

tiveness may not justify the increased computational cost

that larger attribute sets require, even though the increase

in computational cost is linear in the number of attributes.

4.2 Comparison of NB versions

Figure 2 shows the roc curves of the ﬁve nb versions in

each one of the six datasets.15 All the curves are for 3000

attributes, and the error bars correspond to 0.95 conﬁdence

intervals; we show error bars only at some points to avoid

14We used a modiﬁed version of filtron [18] for our experi-

ments, with weka’s implementations of the ﬁve nb versions;

see http://www.cs.waikato.ac.nz/∼ml/weka/.

15Please view the ﬁgures in color, consulting the on-line ver-

sion of this paper if necessary; see http://www.ceas.cc/.

cluttering the diagrams. Since the tolerance of most users

on misclassifying ham messages is very limited, we have re-

stricted the horizontal axis (1−speciﬁcity = 1 −ham recall)

of all diagrams to [0,0.2], i.e., a maximum of 20% of mis-

classiﬁed ham, in order to improve the readability of the

diagrams. On the vertical axis (sensitivity, spam recall) we

show the full range, which allows us to examine what propor-

tion of spam messages the ﬁve nb versions manage to block

when requesting a very low ham misclassiﬁcation rate (when

1−speciﬁcity approaches 0). The optimal performance point

in an roc diagram is the top-left corner, while the area un-

der each curve (auc) is often seen as a summary of the

performance of the corresponding method. We do not, how-

ever, believe that standard auc is a good measure for spam

ﬁlters, because it is dominated by non-high speciﬁcity (ham

recall) regions, which are of no interest in practice. Perhaps

one should compute the area for 1 −speciﬁcity ∈[0,0.2]

or [0,0.1]. Even then, however, it is debatable how the area

should be computed when roc curves do not span the entire

[0,0.2] or [0,0.1] range of the horizontal axis (see below).

A ﬁrst conclusion that can be drawn from the results of

Figure 2 is that some datasets, such as Enron4, are “easier”

than others, such as Enron1. There does not seem to be a

clear justiﬁcation for these diﬀerences, in terms of the ham-

spam ratio or the spam source used in each dataset.

Despite its theoretical association to term frequencies, in

all six datasets the multinomial nb seems to be doing better

when Boolean attributes are used, which agrees with Schnei-

der’s observations [25]. The diﬀerence, however, is in most

cases very small and not always statistically signiﬁcant; it

is clearer in the ﬁrst dataset and, to a lesser extent, in the

last one. Furthermore, the multinomial nb with Boolean at-

tributes seems to be the best performer in 4 out of 6 datasets,

although again by a small and not always statistically sig-

niﬁcant margin, and it is clearly outperformed only by fb in

the other 2 datasets. This is particularly interesting, since

many nb-based spam ﬁlters appear to adopt the multino-

mial nb with tf attributes or the multi-variate Bernoulli nb

(which uses Boolean attributes); the latter seems to be the

worst among the nb versions we evaluated. Among the nb

versions that we tested with normalized tf attributes (fb

and the multi-variate Gauss nb), overall fb is clearly the

best. However, fb does not always outperform the other

nb version that uses non-Boolean attributes, namely the

multinomial nb with tf attributes.

The fb classiﬁer shows signs of impressive superiority in

Enron1 and Enron2; and its performance is almost undis-

tinguishable from that of the top performers in Enron5 and

Enron6. However, it does not perform equally well, com-

pared to the top performers, in the other two datasets (En-

ron3, Enron4), which strangely include what appears to be

the easiest dataset (Enron4). One problem we noticed with

fb is that its estimates for p(c|~x) are very close to 0 or 1;

hence, varying the threshold Thas no eﬀect on the classi-

ﬁcation of many messages. This did not allow us to obtain

higher ham recall (lower 1−speciﬁcity) by trading oﬀ spam

recall (sensitivity) as well as in the other nb versions, which

is why the fb roc curves are shorter in some of the diagrams.

(The same comment applies to the multi-variate Gauss nb.)

Having said that, we were able to reach a ham recall level

of 99.9% or higher with fb in most of the datasets.

Overall, the multinomial nb with Boolean attributes and

fb obtained the best results in our experiments, but the dif-

Enron1 - 3000 Attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

1 - specificity (1 - ham recall)

sensitivity (spam recall)

Flexible Bayes

Multivariate NB, Gaussian

Multinomial NB, TF

Mutivariate NB, binary

Multinomial NB, binary

Enron2 - 3000 Attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

1 - specificity (1 - ham recall)

sensitivity (spam recall)

Flexible Bayes

Multivariate NB, Gaussian

Multinomial NB, TF

Mutivariate NB, binary

Multinomial NB, binary

Enron3 - 3000 Attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

1 - specificity (1 - ham recall)

sensitivity (spam recall)

Flexible Bayes

Multivariate NB, Gaussian

Multinomial NB, TF

Mutivariate NB, binary

Multinomial NB, binary

Enron4 - 3000 Attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

1 - specificity (1 - ham recall)

sensitivity (spam recall)

Flexible Bayes

Multivariate NB, Gaussian

Multinomial NB, TF

Mutivariate NB, binary

Multinomial NB, binary

Enron5 - 3000 Attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

1 - specificity (1 - ham recall)

sensitivity (spam recall)

Flexible Bayes

Multivariate NB, Gaussian

Multinomial NB, TF

Mutivariate NB, binary

Multinomial NB, binary

Enron6 - 3000 Attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

1 - specificity (1 - ham recall)

sensitivity (spam recall)

Flexible Bayes

Multivariate NB, Gaussian

Multinomial NB, TF

Mutivariate NB, binary

Multinomial NB, binary

Figure 2: ROC curves of the ﬁve NB versions with 3000 attributes.

nb version Enr1 Enr2 Enr3 Enr4 Enr5 Enr6 Avg.

fb 90.50 93.63 96.94 95.78 99.56 99.55 95.99

mv Gauss 93.08 95.80 97.55 80.14 95.42 91.95 92.32

mn tf 95.66 96.81 95.04 97.79 99.42 98.08 97.13

mv Bern. 97.08 91.05 97.42 97.70 97.95 97.92 96.52

mn Bool. 96.00 96.68 96.94 97.79 99.69 98.10 97.53

Table 4: Spam recall (%) for 3000 attributes, T= 0.5.

nb version Enr1 Enr2 Enr3 Enr4 Enr5 Enr6 Avg.

fb 97.64 98.83 95.36 96.61 90.76 89.97 94.86

mv Gauss 94.83 96.97 88.81 99.39 97.28 95.87 95.53

mn tf 94.00 96.78 98.83 98.30 95.65 95.12 96.45

mv Bern. 93.19 97.22 75.41 95.86 90.08 82.52 89.05

mn Bool. 95.25 97.83 98.88 99.05 95.65 96.88 97.26

Table 5: Ham recall (%) for 3000 attributes, T= 0.5.

ferences from the other nb versions were often very small.

Taking into account its smoother trade-oﬀ between ham and

spam recall, and its better computational complexity at run

time, we tend to prefer the multinomial nb with Boolean

attributes over fb, but further experiments are necessary to

establish its superiority with conﬁdence. For completeness,

Tables 4 and 5 list the spam and ham recall, respectively, of

the nb versions on the 6 datasets for T= 0.5, although com-

paring at a ﬁxed threshold Tis not particularly informative;

for example, two methods may obtain the same results at

diﬀerent thresholds. On average, the multinomial nb with

Boolean attributes again has the best results, both in spam

and ham recall.

4.3 Learning curves

Figure 3 shows the learning curves (spam and ham recall

as more training messages are accumulated over time) of the

multinomial nb with Boolean attributes on the six datasets

for T= 0.5. It is interesting to observe that the curves

do not increase monotonically, unlike most text classiﬁca-

tion experiments, presumably because of the unpredictable

ﬂuctuation of the ham-spam ratio, the changing topics of

spam, and the adversarial nature of anti-spam ﬁltering. In

the “easiest” dataset (Enron4) the classiﬁer reaches almost

perfect performance, especially in terms of ham recall, after

a few hundreds of messages, and quickly returns to near-

perfect performance whenever a deviation occurs. As more

training messages are accumulated, the deviations from the

perfect performance almost disappear. In contrast, in more

diﬃcult datasets (e.g., Enron1) the ﬂuctuation of ham and

spam recall is continuous. The classiﬁer seems to adapt

quickly to changes, though, avoiding prolonged plateaus of

low performance. Spam recall is particularly high and stable

in Enron5, but this comes at the expense of frequent large

ﬂuctuations of ham recall; hence, the high spam recall may

be the eﬀect of a tradeoﬀ between spam and ham recall.

5. CONCLUSIONS AND FURTHER WORK

We discussed and evaluated experimentally in a spam ﬁl-

tering context ﬁve diﬀerent versions of the Naive Bayes (nb)

classiﬁer. Our investigation included two versions of nb that

have not been used widely in the spam ﬁltering literature,

namely Flexible Bayes (fb) and the multinomial nb with

Boolean attributes. We emulated the situation faced by a

new user of a personalized learning-based spam ﬁlter, adopt-

ing an incremental retraining and evaluation procedure. The

six datasets that we used, and which we make publicly avail-

able, were created by mixing freely available ham and spam

messages in diﬀerent proportions. The mixing procedure

emulates the unpredictable ﬂuctuation over time of the ham-

spam ratio in real mailboxes.

Our evaluation included plotting roc curves, which al-

lowed us to compare the diﬀerent nb versions across the

entire tradeoﬀ between true positives and true negatives.

The most interesting result of our evaluation was the very

good performance of the two nb versions that have been

used less in spam ﬁltering, i.e., fb and the multinomial nb

with Boolean attributes; these two versions collectively ob-

tained the best results in our experiments. Taking also into

account its lower computational complexity at run time and

its smoother trade-oﬀ between ham and spam recall, we tend

to prefer the multinomial nb with Boolean attributes over

fb, but further experiments are needed to be conﬁdent. The

best results in terms of eﬀectiveness were generally achieved

with the largest attribute set (3000 attributes), as one might

have expected, but the gain was rather insigniﬁcant, com-

pared to smaller and computationally cheaper attribute sets.

We are currently collecting more data, in a setting that

will allow us to evaluate the ﬁve nb versions and other learn-

ing algorithms on several real mailboxes with the incremen-

tal retraining and evaluation method. The obvious caveat of

these additional real-user experiments is that it will not be

possible to provide publicly the resulting datasets in a non-

encoded form. Therefore, we plan to release them using the

encoding scheme of the pu datasets.

6. REFERENCES

[1] I. Androutsopoulos, J. Koutsias, K. Chandrinos, and

C. Spyropoulos. An experimental comparison of Naive

Bayesian and keyword-based anti-spam ﬁltering with

encrypted personal e-mail messages. In 23rd ACM

SIGIR Conference, pages 160–167, Athens, Greece,

2000.

[2] I. Androutsopoulos, G. Paliouras, and E. Michelakis.

Learning to ﬁlter unsolicited commercial e-mail.

technical report 2004/2, NCSR “Demokritos”, 2004.

[3] R. Beckermann, A. McCallum, and G. Huang.

Automatic categorization of email into folders:

benchmark experiments on Enron and SRI corpora.

Technical report IR-418, University of Massachusetts

Amherst, 2004.

[4] X. Carreras and L. Marquez. Boosting trees for

anti-spam email ﬁltering. In 4th International

Conference on Recent Advances in Natural Language

Processing, pages 58–64, Tzigov Chark, Bulgaria,

2001.

[5] P. Domingos and M. Pazzani. On the optimality of the

simple Bayesian classiﬁer under zero-one loss. Machine

Learning, 29(2–3):103130, 1997.

[6] H. D. Drucker, D. Wu, and V. Vapnik. Support Vector

Machines for spam categorization. IEEE Transactions

On Neural Networks, 10(5):1048–1054, 1999.

[7] S. Eyheramendy, D. Lewis, and D. Madigan. On the

Naive Bayes model for text categorization. In 9th

International Workshop on Artiﬁcial Intelligence and

Statistics, pages 332–339, Key West, Florida, 2003.

[8] T. Fawcett. In “vivo” spam ﬁltering: a challenge

Enron1 - Multinomial NB, Boolean - 3000 Attributes

0.7

0.75

0.8

0.85

0.9

0.95

1

1 4 7 1013161922252831343740434649

Number of emails x 100

Spam Recall Ham Recall

Enron2 - Multinomial NB, Boolean - 3000 Attributes

0.7

0.75

0.8

0.85

0.9

0.95

1

1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

Number of emails x 100

Spam Recall Ham Recall

Enron3 - Multinomial NB, Boolean - 3000 Attributes

0.7

0.75

0.8

0.85

0.9

0.95

1

1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

Number of emails x 100

Spam Recall Ham Recall

Enron4 - Multinomial NB, Boolean - 3000 Attributes

0.7

0.75

0.8

0.85

0.9

0.95

1

1 4 7 1013161922252831343740434649525558

Number of emails x 100

Spam Recall Ham Recall

Enron5 - Multinomial NB, Boolean - 3000 Attributes

0.7

0.75

0.8

0.85

0.9

0.95

1

1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

Number of emails x 100

Spam Recall Ham Recall

Enron6 - Multinomial NB, Boolean - 3000 Attributes

0.7

0.75

0.8

0.85

0.9

0.95

1

1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

Number of emails x 100

Spam Recall Ham Recall

Figure 3: Learning curves for the multinomial NB with Boolean attributes and T= 0.5.

problem for KDD. SIGKDD Explorations,

5(2):140–148, 2003.

[9] S. Hershkop and S. Stolfo. Combining email models

for false positive reduction. In 11th ACM SIGKDD

Conference, pages 98–107, Chicago, Illinois, 2005.

[10] J. G. Hidalgo. Evaluating cost-sensitive unsolicited

bulk email categorization. In 17th ACM Symposium

on Applied Computing, pages 615–620, 2002.

[11] J. G. Hidalgo and M. M. Lopez. Combining text and

heuristics for cost-sensitive spam ﬁltering. In 4th

Computational Natural Language Learning Workshop,

pages 99–102, Lisbon, Portugal, 2000.

[12] J. Hovold. Naive Bayes spam ﬁltering using

word-position-based attributes. In 2nd Conference on

Email and Anti-Spam, Stanford, CA, 2005.

[13] J. T. J.D.M. Rennie, L. Shih and D. Karger. Tackling

the poor assumptions of Naive Bayes classiﬁers. In

20th International Conference on Machine Learning,

pages 616–623, Washington, DC, 2003.

[14] G. John and P. Langley. Estimating continuous

distributions in Bayesian classiﬁers. In 11th

Conference on Uncertainty in Artiﬁcial Intelligence,

pages 338–345, Montreal, Quebec, 1995.

[15] B. Klimt and Y. Yang. The Enron corpus: a new

dataset for email classiﬁcation research. In 15th

European Conference on Machine Learning and the

8th European Conference on Principles and Practice

of Knowledge Discovery in Databases, pages 217–226,

Pisa, Italy, 2004.

[16] A. Kolcz and J. Alspector. SVM-based ﬁltering of

e-mail spam with content-speciﬁc misclassiﬁcation

costs. In Workshop on Text Mining, IEEE

International Conference on Data Mining, San Jose,

California, 2001.

[17] A. McCallum and K. Nigam. A comparison of event

models for naive bayes text classiﬁcation. In AAAI’98

Workshop on Learning for Text Categorization, pages

41–48, Madison, Wisconsin, 1998.

[18] E. Michelakis, I. Androutsopoulos, G. Paliouras,

G. Sakkis, and P. Stamatopoulos. Filtron: a

learning-based anti-spam ﬁlter. In 1st Conference on

Email and Anti-Spam, Mountain View, CA, 2004.

[19] P. Pantel and D. Lin. SpamCop: a spam classiﬁcation

and organization program. In Learning for Text

Categorization – Papers from the AAAI Workshop,

pages 95–98, Madison, Wisconsin, 1998.

[20] F. Peng, D. Schuurmans, and S. Wang. Augmenting

naive bayes classiﬁers with statistical language

models. Information Retrieval, 7:317–345, 2004.

[21] M. Sahami, S. Dumais, D. Heckerman, and

E. Horvitz. A Bayesian approach to ﬁltering junk

e-mail. In Learning for Text Categorization – Papers

from the AAAI Workshop, pages 55–62, Madison,

Wisconsin, 1998.

[22] G. Sakkis, I. Androutsopoulos, G. Paliouras,

V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos.

Stacking classiﬁers for anti-spam ﬁltering of e-mail. In

Conference on Empirical Methods in Natural

Language Processing, pages 44–50, Carnegie Mellon

University, Pittsburgh, PA, 2001.

[23] G. Sakkis, I. Androutsopoulos, G. Paliouras,

V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos.

A memory-based approach to anti-spam ﬁltering for

mailing lists. Information Retrieval, 6(1):49–73, 2003.

[24] K.-M. Schneider. A comparison of event models for

Naive Bayes anti-spam e-mail ﬁltering. In 10th

Conference of the European Chapter of the ACL,

pages 307–314, Budapest, Hungary, 2003.

[25] K.-M. Schneider. On word frequency information and

negative evidence in Naive Bayes text classiﬁcation. In

4th International Conference on Advances in Natural

Language Processing, pages 474–485, Alicante, Spain,

2004.

[Re] If you like Shapley, then you'll love the core

Article

Full-text available

Jul 2023

We investigate the results of [1] in the field of data valuation. We repeat their experiments and conclude that the (Monte Carlo) Least Core is sensitive to important characteristics of the ML problem of interest, making it difficult to apply. Scope of Reproducibility-We test all experimental claims about Monte Carlo approximations to the Least Core and their application to standard data valuation tasks. Methodology-We use the open source library [2] for all valuation algorithms. We document all details on dataset choice and generation in this paper, and release all code as open source in [3]. Results-We were able to reproduce the results on Least Core approximation. For the task of low-value point identification we observed similar performance for least core and (Truncated Monte Carlo) Shapley values, whereas for high-value identification, the least core outperformed other methods. In two experiments, we must depart from the original paper and arrive at different conclusions. Overall, we find that the Least Core offers similar results to other game-theoretic approaches to data valuation, but it does not overcome the main drawbacks of computational complexity and sensitivity to ran-domness that such techniques have. What was easy-Open source libraries like DVC and ray enabled efficiently designing and running the experiments. What was difficult-Data generation was difficult for dog-vs-fish because no code was available. Computing the Monte Carlo Least Core was very sensitive to the choice of utility function. Reproducing some experiments was difficult due to lack of details. Communication with original authors-We asked the authors for details on the experimental setup and they kindly and promptly sent us the code used for the paper. This was very useful in understanding all steps taken and in uncovering some weaknesses in the experiments.

Machine Learning Based Classification for Spam Detection

Article

Full-text available

Apr 2024

Electronic Electronic messages, i.e. e-mails, are a communication tool frequently used by individuals or organizations. While e-mail is extremely practical to use, it is necessary to consider its vulnerabilities. Spam e-mails are unsolicited messages created to promote a product or service, often sent frequently. It is very important to classify incoming e-mails in order to protect against malware that can be transmitted via e-mail and to reduce possible unwanted consequences. Spam email classification is the process of identifying and distinguishing spam emails from legitimate emails. This classification can be done through various methods such as keyword filtering, machine learning algorithms and image recognition. The goal of spam email classification is to prevent unwanted and potentially harmful emails from reaching the user's inbox. In this study, Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms are used to classify spam emails and the results are compared. Algorithms with different approaches were used to determine the best solution for the problem. 5558 spam and non-spam e-mails were analyzed and the performance of the algorithms was reported in terms of accuracy, precision, sensitivity and F1-Score metrics. The most successful result was obtained with the RF algorithm with an accuracy of 98.83%. In this study, high success was achieved by classifying spam emails with machine learning algorithms. In addition, it has been proved by experimental studies that better results are obtained than similar studies in the literature.

Enhancing DDoS Attack Detection and Mitigation in SDN Using an Ensemble Online Machine Learning Model

Article

Full-text available

Jan 2024

Software Defined Networks (SDN) offer dynamic reconfigurability and scalability, revolutionizing traditional networking. However, countering Distributed Denial of Service (DDoS) attacks remains a formidable challenge for both traditional and SDN-based networks. The integration of Machine Learning (ML) into SDN holds promise for addressing these threats. While recent research demonstrates ML’s accuracy in distinguishing legitimate from malicious traffic, it faces difficulties in handling emerging, low-rate, and zero-day DDoS attacks due to limited feature scope for training. The ever-evolving DDoS landscape, driven by new protocols, necessitates continuous ML model retraining. In response to these challenges, we propose an ensemble online machine-learning model designed to enhance DDoS detection and mitigation. This approach utilizes online learning to adapt the model with expected attack patterns. The model is trained and evaluated using SDN simulation (Mininet and Ryu). Its dynamic feature selection capability overcomes conventional limitations, resulting in improved accuracy across diverse DDoS attack types. Experimental results demonstrate a remarkable 99.2% detection rate, outperforming comparable models on our custom dataset as well as various benchmark datasets, including CICDDoS2019, InSDN, and slow-read-DDoS. Moreover, the proposed model undergoes comparison with industry-standard commercial solutions. This work establishes a strong foundation for proactive DDoS threat identification and mitigation in SDN environments, reinforcing network security against evolving cyber risks.

An Overview of Web Content Filtering Techniques

Article

Full-text available

Jun 2024

Folorunso Ojo

Content filtering is most widely used on the internet to filter email and web based content. It is the technique used to screen and exclude from access of Web pages in internet or e-mail that is deemed objectionable. It is also known as information filtering. It works by specifying character strings that, if matched, indicate undesirable content that is to be screened out. For many, accessing the Internet is a mixed blessing; in worst case, it can create serious problems. Web Content Filtering is a firewall to block certain sites from being accessed. Content filtering and the products that offer this service can be divided into Web filtering, the screening of Web sites or pages, and e-mail filtering, the screening of e-mail for spam or other objectionable content. This paper provides a conclusive survey of major types, tasks, tools, process, an algorithm involved in the web content filtering and also suggested a new methodology to screened the text content in the Webpages and make a decision algorithm whether the webpage was allowed or banned from access.

DeGAN - Decomposition-Based Unified Anomaly Detection in Static Networks

Article

Full-text available

Jun 2024
INFORM SCIENCES

Graph anomaly detection aims to identify anomalous occurrences in networks. However, this is more challenging than the traditional anomaly detection problem because anomalies in graphs can manifest in three different forms: anomalous nodes, anomalous edges, and anomalous sub-graphs. It is crucial to detect all these anomaly types within a single framework to provide a unified solution to the graph anomaly detection task. The main objective of this study is to propose a model that is capable of detecting all static graph anomalies in a single architecture across various domains. In this paper, we introduce DeGAN (Decomposition-based unified Graph ANomaly detection), a novel framework for unified graph anomaly detection in static networks. DeGAN combines two deep learning concepts with graph decomposition to identify anomalous graph objects: a graph neural network and an adversarial autoencoder. DeGAN is featured with its capability to detect anomalies in a single process, and adopting graph decomposition has improved performance compared to the traditional adversarial learning approach. DeGAN is evaluated with six real-world datasets to demonstrate that our framework can work in multiple domains. Experimental results demonstrate that DeGAN is capable of detecting anomalous nodes, edges, and sub-graphs within a single model. Additionally, the effectiveness of the sub-components of DeGAN has been demonstrated through experimentation. Even though DeGAN is proposed for plain graphs, it can be extended to attributed and dynamic graphs. Keywords: Graph anomaly detection, Graph decomposition, Adversarial autoencoder, Graph neural networks

Defending Against Similarity Shift Attack for EaaS via Adaptive Multi-Target Watermarking

Article

Jun 2024
INFORM SCIENCES

The impact of politicians’ behaviors on hate speech spread: hate speech adoption threshold on Twitter in Japan

Article

Full-text available

Apr 2024

In this paper, we hypothesized that a leading politician’s behavior signaling support for the ideas of hate groups will increase hate speech by shifting social norms. To examine the hypotheses, in particular, to quantify the social norms shift, the study utilized an adoption threshold measure based on Twitter retweet networks. Our empirical study focused on the hate speech spread effect of an announcement by Yuriko Koike, Governor of Tokyo, declining to participate in a memorial ceremony for Korean massacre victims. The results support the hypothesis: after Koike’s announcement, adoption thresholds of hate speech users were significantly lowered, while the number of hate speech users increased. Further, this study compared the effect of Koike’s announcement to the effect of a North Korean missile launch, a national security threat to Japan. The average hate speech adoption threshold was lower after Koike’s announcement than after the missile launch, suggesting that a leading politician’s behavior could have a greater impact on shifting norms of prejudice than even a nationally threatening event.

The Effectiveness of the Ensemble Naive Bayes in Analyzing Review Sentiment of the Lazada Application on Google Play

Conference Paper

Jan 2024

BlindFilter: Privacy-Preserving Spam Email Detection Using Homomorphic Encryption

Conference Paper

Sep 2023

An empirical assessment of different word embedding and deep learning models for bug assignment

Article

Apr 2024
J SYST SOFTWARE

Combining text and heuristics for cost-sensitive spam filtering

Conference Paper

Full-text available

Sep 2000

Spam filtering is a text categorization task that shows especial features that make it interesting and difficult. First, the task has been performed traditionally using heuristics from the domain. Second, a cost model is required to avoid misclassification of legitimate messages. We present a comparative evaluation of several machine learning algorithms applied to spam filtering, considering the text of the messages and a set of heuristics for the task. Cost-oriented biasing and evaluation is performed.

A Memory-Based Approach to Anti-Spam Filtering

Article

Full-text available

Jan 2003

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, also known as "spam", floods the mailboxes of users, causing frustration, wasting bandwidth and money, and exposing minors to unsuitable content. Using a recently introduced publicly available corpus, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed, including different attribute and distance weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naïve Bays filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

Evaluating cost-sensitive Unsolicited Bulk Email categorization

Conference Paper

Full-text available

Mar 2002

Jose Maria Gomez Hidalgo

In the recent years, Unsolicited Bulk Email has became an increasingly important problem, with a big economic impact. In this paper, we discuss cost-sensitive Text Categorization methods for UBE filtering. In concrete, we have evaluated a range of Machine Learning methods for the task (C4.5, Naive Bayes, PART, Support Vector Machines and Rocchio), made cost sensitive through several methods (Threshold Optimization, Instance Weighting, and Meta-Cost). We have used the Receiver Operating Characteristic Convex Hull method for the evaluation, that best suits classification problems in which target conditions are not known, as it is the case. Our results do not show a dominant algorithm nor method for making algorithms cost-sensitive, but are the best reported on the test collection used, and approach real-world hand-crafted classifiers accuracy.

SVM-based Filtering of E-mail Spam with Content-speciÞc MisclassiÞcation Costs

Article

Joshua Alspector

We address the problem of separating legitimate emails from uncolicited ones in the context of a large-scale operation, where the diversity of user accounts is very high, while misclassiÞcation costs are content-dependent and highly asymmetric. A category speciÞc cost model is proposed and several effective methods of training a cost sensitive Þlter are studied, using a Support Vector Machine (SVM) as the base classiÞer. Clear beneÞts of explictly accounting for varied misclassiÞcation costs, either during training or as a form of post-processing, are shown.

Augmenting Naive {B}ayes Classifiers with Statistical Language Models

Article

Sep 2004

We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the C hain A ugmented N aive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes—allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems—authorship attribution, text genre classification, and topic detection—in several languages—Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model.

Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora

Article

Jan 2004

Office workers everywhere are drowning in email—not only spam, but also large quan-tities of legitimate email to be read and organized for browsing. Although there have been extensive investigations of automatic document categorization, email gives rise to a num-ber of unique challenges, and there has been relatively little study of classifying email into folders. This paper presents an extensive benchmark study of email foldering using two large corpora of real-world email messages and foldering schemes: one from former Enron em-ployees, another from participants in an SRI research project. We discuss the challenges that arise from differences between email foldering and traditional document classification. We show experimental results from an array of automated classification methods and eval-uation methodologies, including a new evaluation method of foldering results based on the email timeline, and including enhancements to the exponential gradient method Winnow, providing top-tier accuracy with a fraction the training time of alternative methods. We also establish that classification accuracy in many cases is relatively low, confirming the challenges of email data, and pointing toward email foldering as an important area for further research.

A memory-based approach to anti-spam filtering for mailing lists

Article

Jan 2003

Georgios Sakkis

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

Learning to filter unsolicited commercial e-mail

Article

Ion Androutsopoulos

We present a thorough investigation on using machine learning to construct effective personalized anti-spam filters. The investigation includes four learning algorithms, Naive Bayes, Flexible Bayes, LogitBoost, and Support Vector Machines, and four datasets, constructed from the mailboxes of different users. We discuss the model and search biases of the learning algorithms, along with worst-case computational complexity figures, and observe how the latter relate to experimental measurements. We study how classification accuracy is affected when using attributes that rep-resent sequences of tokens, as opposed to single tokens, and explore the effect of the size of the attribute and training set, all within a cost-sensitive framework. Furthermore, we describe the architecture of a fully implemented learning-based anti-spam filter, and present an analysis of its behavior in real use over a period of seven months. Information is also provided on other available learning-based anti-spam filters, and alternative filtering approaches.

On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification

Conference Paper

Jan 2004
Lect Notes Comput Sci

Karl-Michael Schneider

The Naive Bayes classifier exists in different versions. One version, called multi-variate Bernoulli or binary independence model, uses binary word occurrence vectors, while the multinomial model uses word frequency counts. Many publications cite this difference as the main reason for the superior performance of the multinomial Naive Bayes clas- sifier. We argue that this is not true. We show that when all word fre- quency information is eliminated from the document vectors, the multi- nomial Naive Bayes model performs even better. Moreover, we argue that the main reason for the difference in performance is the way that negative evidence, i.e. evidence from words that do not occur in a doc- ument, is incorporated in the model. Therefore, this paper aims at a better understanding and a clarification of the difference between the two probabilistic models of Naive Bayes.

The Enron Corpus: A New Dataset for Email Classification Research

Conference Paper

Sep 2004
Lect Notes Comput Sci

Automated classication of email messages into user-specic folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state- of-the-art classier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classier, and using all the sections in combination with regression weights.

Data

Full-text available

Table S2

March 2013

Statistical significance of the improvement in time-dependent AUCs in cross-dataset evaluation (Sloan-Kettering cancer genes). The R package “timeROC” (the algorithm was described in the paper “Estimating and Comparing time-dependent areas under ROC curves for censored event times with competing risks”) was used to compute the . The null hypothesis asserts that two time-dependent AUCs estimated ... [Show full abstract] by two models are equal. The significant smaller than 0.1 are bold. The tables show the results for the death outcome by training with TCGA dataset and test on Tothill Dataset (a), for the death outcome by training with TCGA dataset and test on Bonome Dataset (b), for the tumor recurrence outcome by training with TCGA dataset and test on Tothill Dataset (c). (PDF)

Data

Full-text available

Figure S1

May 2013

ROC curves comparing different methods based on the human protein-protein interaction datasets (including IEA). The evaluation was carried out for the BP and CC ontologies. The (A and B) maximum (MAX) and (C and D) best-match average (BMA) pairwise rules were used in the ROC analysis. (PDF)

Data

Full-text available

Figure S1

October 2012

ROC curves of EXIA and CRpred on the EF fold dataset. (PDF)

Article

Full-text available

Controlling true positive rate in ROC analysis

Trygve Eftestøl

ROC analysis is a widely used method for evaluating the performance of classifiers. In analysis involving scarce data sets leave-one-out resampling techniques might be appropriate. This introduces a problem in terms of computing average ROC curves necessary to determine variance in the true positive and negative rates. A method to determine decision regions for a specified true positive rate is ... [Show full abstract] presented. The method is based on estimating the class specific probability density functions for the two classes. The functions are discretised. Dividing these yields a function where values above or below a specific threshold value corresponds to deciding class one or two respectively. It is shown how a gradual lowering of the threshold value corresponds to an increase in the true positive rate, and how a true positive rate can be specified and the corresponding threshold determined. An example with simulated data is used to demonstrate the method.