Conference PaperPDF Available

On the Validity of a New SMS Spam Collection

December 2012

December 2012
2:240-245

DOI:10.1109/ICMLA.2012.211

Conference: Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Volume: 2

Authors:

Jose Maria Gomez Hidalgo

Domo

Tiago A. Almeida

Universidade Federal de São Carlos

Akebo Yamakami

University of Campinas

Mobile phones are becoming the latest target of electronic junk mail. Recent reports clearly indicate that the volume of SMS spam messages are dramatically increasing year by year. Probably, one of the major concerns in academic settings was the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. To address this issue, we have recently proposed a new SMS Spam Collection that, to the best of our knowledge, is the largest, public and real SMS dataset available for academic studies. However, as it has been created by augmenting a previously existing database built using roughly the same sources, it is sensible to certify that there are no duplicates coming from them. So, in this paper we offer a comprehensive analysis of the new SMS Spam Collection in order to ensure that this does not happen, since it may ease the task of learning SMS spam classifiers and, hence, it could compromise the evaluation of methods. The analysis of results indicate that the procedure followed does not lead to near-duplicates and, consequently, the proposed dataset is reliable to use for evaluating and comparing the performance achieved by different classifiers.

Content uploaded by Tiago A. Almeida

Content may be subject to copyright.

On the Validity of a New SMS Spam Collection

José María Gómez Hidalgo

R&D Department

Optenet

Las Rozas, Madrid, Spain

jgomez@optenet.com

Tiago A. Almeida

Department of Computer Science

Federal University of São Carlos – UFSCar

Sorocaba, São Paulo, Brazil

talmeida@ufscar.br

Akebo Yamakami

School of Electrical and Computer Engineering

University of Campinas – UNICAMP

Campinas, São Paulo, Brazil

akebo@dt.fee.unicamp.br

Abstract—Mobile phones are becoming the latest target of

electronic junk mail. Recent reports clearly indicate that the

volume of SMS spam messages are dramatically increasing

year by year. Probably, one of the major concerns in academic

settings was the scarcity of public SMS spam datasets, that

are sorely needed for validation and comparison of different

classiﬁers. To address this issue, we have recently proposed a

new SMS Spam Collection that, to the best of our knowledge, is

the largest, public and real SMS dataset available for academic

studies. However, as it has been created by augmenting a

previously existing database built using roughly the same

sources, it is sensible to certify that there are no duplicates

coming from them. So, in this paper we offer a comprehensive

analysis of the new SMS Spam Collection in order to ensure

that this does not happen, since it may ease the task of

learning SMS spam classiﬁers and, hence, it could compromise

the evaluation of methods. The analysis of results indicate

that the procedure followed does not lead to near-duplicates

and, consequently, the proposed dataset is reliable to use

for evaluating and comparing the performance achieved by

different classiﬁers.

Keywords-Spam ﬁltering; Mobile spam; Text categorization;

Classiﬁcation; Text analysis.

I. INTRODUCTION

Text messaging is a communication service component of

phone, web or mobile communication systems, using stan-

dardized communications protocols that allow the exchange

of short text messages between ﬁxed line or mobile phone

devices. While the original term was derived from referring

to messages sent using the Short Message Service (SMS),

it has since been extended to include messages containing

image, video, and audio.

Mobile text messages are commonly used between mobile

phone users, as a substitute for voice calls in situations

where voice communication is impossible or undesirable.

Such way of communication is also very popular because

in some places text messages are signiﬁcantly cheaper than

placing a phone call to another mobile phone.

Messaging still dominates mobile market non-voice rev-

enues worldwide. According to a report recently provided by

Portio Research

, the worldwide mobile messaging market

was worth USD 179.2 billion in 2010, has passed USD 200

billion in 2011, and probably will reach USD 300 billion in

2014. The same study indicates that annual worldwide SMS

http://www.portioresearch.com/MMF11-15.html

trafﬁc volumes rose to over 6.9 trillion at end-2010 to break

8 trillion by end-2011.

Mobile messages can be used to interact with automated

systems such as ordering products and services for mobile

phones or participating in contests. Service providers and

advertisers use direct text marketing to notify mobile phone

users about promotions, payment due dates and other noti-

ﬁcations that can usually be sent by post or e-mail.

The downside is that cell phones are becoming the latest

target of electronic junk mail, with a growing number of

marketers using text messages to target subscribers. SMS

spam (sometimes also called mobile phone spam) is any

unwanted or unsolicited text message received on a mobile

device. Although this practice is rare in North America, it

has been very common in some parts of Asia.

SMS text messaging offers a target rich environment for

spammers. With the explosive growth in text messaging

along with unlimited texting plans it barely costs anything

for the attackers to send malicious messages. This combined

with the trust users inherently have in their mobile devices

makes it an environment rife for attack. In fact, a recently

Cloudmark company research

reveals that ﬁnancial fraud

and spam via text messages is now growing at a rate of

over 300 percent year over year.

In the same way that carriers are facing a real challenge in

dealing with SMS spam, academic researchers in this ﬁeld

are also experiencing some difﬁculties. Probably, one of the

major concern corresponds to the lack of large, real and

public databases. Unlike the large amount of available email

spam datasets [1], [2], [3], [4], [5], [6], [7], there are very

few corpora with real examples of mobile phone spam, and

to make matters worse, they are usually of small size.

To ﬁll this important gap, we have recently proposed the

new SMS Spam Collection [8], which is a real, public,

non-encoded, and the largest SMS spam corpus as far as

we know. However, it have been created by augmenting a

previously existing database built using roughly the same

sources. Thus, it is very important to verify if there are

some duplicates coming from other databases, since added

messages may contain previously existing messages in the

original collection. In this way, in this paper we have per-

formed a detailed analysis of the new SMS Spam Collection

in order to ensure that this does not happen, as it may ease

the task of learning SMS spam classiﬁers.

http://blog.cloudmark.com/2011/12/05/surge-in-

financial-related-mobile-spam-in-q4

This paper is organized as follows: Section II presents the

new SMS Spam Collection. A comprehensive near-duplicate

analysis of the new SMS Spam Collection with the main

results are presented in Section III. Finally, in Section IV,

we offer conclusions and outlines for future work.

II. T

HE SMS SPAM COLLECTION

Reliable data are essential in any scientiﬁc research. It is

a common sense that the absence of representative data can

seriously impact the processes of evaluation and comparison

of methods and unfortunately, areas of more recent studies

are generally affected by the lack of public available data.

To address the lack of SMS spam datasets, in [8], we

propose a new real, public and non-encoded SMS Spam Col-

lection

that is the largest one as far as we know. Moreover,

we offer a comprehensive performance evaluation comparing

several established machine learning methods in order to

provide good baseline results for further comparison.

As pointed out in [8], to create the SMS Spam Collection

we have collected data derived from different sources.

First, a set of 425 SMS spam messages was manually

extracted from the Grumbletext Web site

.ThisisaUK

forum in which cell phone users make public claims about

SMS spam messages, most of them without reporting the

very spam message received. The identiﬁcation of the text

of spam messages in the claims is a very hard and time-

consuming task, and it involved carefully scanning hundreds

of web pages.

We have also added legitimate samples by inserting 450

SMS messages collected from Caroline Tag’s PhD Thesis

Furthermore, we have selected 3,375 SMS ham messages

randomly chosen of the NUS SMS Corpus

Finally, we have incorporated the SMS Spam Corpus v.0.1

Big

that is composed by 1,002 SMS ham messages and 322

spam messages. More detail about this dataset can be found

in [9], [10], and [11]. However, it is important to point out

that the sources used for building this corpus are almost the

same used to create the new SMS Spam Collection.

Despite the importance of the new collection in a scenario

that requires a lot of such data, the SMS Spam Collection

was created with messages of previously existing database

built using roughly the same sources. Therefore, at this

stage, it is very important to perform a careful analysis of

the validity of the proposed dataset checking if there are

duplicates coming from both databases.

III. D

UPLICATE ANALYSIS OF THE SMS SPAM

COLLECTION

To ensure that the way the SMS Spam Collection has

built, by reusing the same message sources, does not lead to

The SMS Spam Collection is available at http://www.dt.fee.

unicamp.br/˜tiago/smsspamcollection

The Grumbletext Web site is available at http://www.

grumbletext.co.uk/

The Caroline Tag’s PhD Thesis is available at http://

etheses.bham.ac.uk/253/1/Tagg09P hD.pdf

The NUS SMS Corpus is available at http://www.comp.

nus.edu.sg/˜rpnlpir/downloads/corpora/smsCorpus/

The SMS Spam Corpus v.0.1 Big is public available at http://

www.esp.uem.es/jmgomez/smsspamcorpus/

invalid SMS spam ﬁltering results, it is needed to study the

potential overlap between the sub-collections that have been

used when building it. The hypothesis is that the messages

added to the original SMS collection, even extracted from

the same sources (the Grumbletext site, the NUS SMS

Corpus), do not add duplicates to those previously existing

messages, except for those previously existing in the original

collection or the messages sources themselves. In this way,

if there are duplicates in the ﬁnal collection, the only causes

can be:

∙ Spammers do use templates when writing their spam

messages.

∙ Legitimate users do make use of message templates

existing in their mobile phones.

∙ Legitimate users do re-send chain letters (e.g. jokes,

Christmas messages, etc.).

So, if the task of SMS spam ﬁltering is eased because of

these duplicate messages, the reason for this is the actual

behavior of SMS messaging by spammers and legitimate

users, and not the way the collection used for testing was

built.

In consequence, we have built three SMS sub-collections

described below (original, added and all messages), and we

have studied the most frequent duplicates in all the sub-

collections. The hypothesis gets conﬁrmed if:

1) The existing duplicates in the original sub-collection

keep the same frequency statistics in the ﬁnal collec-

tion, and

2) the existing duplicates in the added messages keep the

same frequency statistics in the ﬁnal collection as well.

In the next sections, we describe the three sub-collections

used in the study, along with the approach we have used to

detect message duplicates, or more properly, near-duplicates.

We detail the results of the analysis, which conﬁrm our

hypothesis.

A. Text collections

In order to evaluate the potential overlap between the

datasets which were used to build the proposed SMS Spam

Collection, we have searched for near-duplicates within three

sub-collections:

∙ The previously existing SMS Spam Corpus v.0.1 Big

(INIT).

∙ The SMS collection that includes the additional mes-

sages from Grumbletext, the NUS SMS Corpus, and

the Tag’s PhD Thesis (ADD).

∙ The released SMS Spam Collection (FINAL).

The INIT dataset has a total of 1,324 text messages where

1,002 are ham and 322 are spam. The ADD sub-collection

is composed by 3,825 legitimate messages and 425 mobile

spam messages, for a total of 4,250 text messages. The

percentages of ham and spam are shown in Table I.

It is worth noticing that the previously existing SMS

Spam Corpus v.0.1 Big, which corresponds to the INIT

sub-collection, poses a simpler problem to machine learning

content based spam ﬁlters, as the collection is more balanced

than the new SMS Spam Collection. On the other side, the

new collection is much bigger, and more data often implies

better learning generalization.

Table I: How the sub-collections are composed.

INIT ADD

Class Amount Pct Amount Pct

Ham 1,002 75.68 3,825 90.00

Spam 322 24.32 425 10.00

Total 1,324 100.00 4,250 100.00

In Table II we present the main statistics related to the

tokens extracted from the INIT and ADD sub-collections.

Table II: Basic statistics related to the tokens extracted from

the sub-collections.

INIT ADD

Ham 12,192 51,419

Spam 7,682 9,861

Total 19,874 61,280

Avg per Msg 15.01 14.42

Avg in Ham 12.17 13.44

Avg in Spam 23.86 23.20

Note that, for both sub-collections, mobile phone spams

are in average ten tokens larger than legitimate messages.

Also note that the average tokens per message is quite

similar in both sub-collections.

B. Near-duplicate detection approach

For the particular needs of this study, and given the short

nature of SMS messages, the “String-of-Text” method can

be considered as a reasonable baseline for the purpose of

detecting near-duplicated messages in our collection.

The “String-of-Text” method, implemented by the

WCopyﬁnd

tool, involves scanning suspect texts for ap-

proximately matching character sequences. In order to avoid

little manual modiﬁcations, this approximation can involve

transformations like case changing, separators variation (e.g.

addressing those users including more white spaces between

words), etc.

The “String-of-Text” method is a simpliﬁed version of the

general N-gram matching detection method, widely used in

the literature [12], [13]. An N-gram is an ordered sequence

of tokens or words present in a text, in which N is the

number of tokens.

For this purpose, texts are compared searching for N-

grams for relatively big sizes (e.g. N=6), with additional

parameters (length of match in number of characters, etc.).

This approach is implemented in WCopyﬁnd, but we have

simpliﬁed it to N-gram matches after text normalization

involving:

∙ Replacing all token separators by white spaces.

∙ Lowercasing all characters.

∙ Replacing digits by the character ‘N’ (to preserve phone

numbers structure).

For instance, the 6-gram “stop to NNNNN

customer services NNNNNNNNNNN” corresponds to

a match between the next two messages within the ADD

See: http://plagiarism.phys.virginia.edu

sub-collection:

Thank you, winner notified by sms. Good

Luck! No future marketing reply STOP

to 84122 customer services 08450542832

and

Your unique user ID is 1172. For removal

send STOP to 87239 customer services

08708034412

As it can be seen, both messages are not near-duplicates;

instead, they share a common pattern in messages reported

by users as SMS spam in the Grumbletext site, which is the

matching 6-gram. In particular, both messages correspond to

two different SMS advertising campaigns in which the users

have actually not subscribed the service.

In consequence, this near-duplicate approach, especially

with relatively short N-Grams, can lead to many false

positives. As a result, the statistics collected during our

analysis represent an upper bound of the potential near-

duplicates that occur in the ﬁnal collection. In our opinion,

this is safer than ﬁnding a lower bound, because in this way

no near-duplicates will be missing, and the conclusions of

the study are sound.

In order to ﬁnd matching N-grams and message near-

duplicates within a given sub-collection, we have followed

the next procedure:

1) All messages within the sub-collection are taken as a

sorted list.

2) Each N-gram for a message is built from left to right.

3) A match or hit is registered when an N-gram present

in a message 𝑖 is found in a message 𝑗, with 𝑖<𝑗.

4) If a hit for messages 𝑖 and 𝑗 is registered, no other

matches between those messages are stored.

5) All N-grams occurring in two or more messages are

stored, along with the number of messages in which

they occur.

Thus, if a particular N-gram is present in messages 𝑖, 𝑗

and 𝑘 with 𝑖<𝑗<𝑘, only the hits for 𝑖 and 𝑗, and for 𝑗 and

𝑘 are counted. It must be noted that it is possible that there

is a match between messages 𝑖 and 𝑗, and another match

between 𝑗 and 𝑘, but not between 𝑖 and 𝑘 because both

previous matching N-grams are different (although they may

have some overlap). In consequence, the way we compare

SMS messages is not symmetric.

It is worth noting that it may be the case that two messages

have several N-grams in common. In fact, that would be the

case for full long duplicate messages. In this situation, only

the ﬁrst left N-gram is reported, and then other co-occurring

N-grams may be missing counts for yet other messages.

C. Results and analysis

The goal of this process is to check if merging the

ﬁrst two sub-collections adds many near-duplicates to the

ﬁnal database, in order to assess the overlap between both

collections. Within each sub-collection, we have compared

each pair of messages, stored all N size matches (N-grams

with N = 5, 6, and 10), and sorted the N-grams according

to their frequency, examining in detail the top ten ones per

N. According to the literature, N = 6 is a typical number for

detecting near-duplicate paragraphs, and we have tested N =

5 because some messages were exactly this long, but there

are not nearly shorter messages. Moreover, while N = 5 or

N = 6 can lead to many false positives, these hits can be

reﬁned with the longer matches required with N = 10, which

in turn is quite close to the actual message length average.

1) Frequency results: We show the overall N-gram

occurrence statistics for N = 5, 6 and 10 in the INIT,

ADD and FINAL sub-collections in Table III. In the third

column, we list the number of unique N-grams with 2 or

more occurrences for a given size in each sub-collection. As

it can be expected, we can view that the numbers increase

with the the number of messages in each sub-collection.

Table III: N-gram occurrence statistics for different sizes in

the studied sub-collections.

N sub #uniq sum avg std

INIT 186 573 3.08 1.56

ADD 484 1292 2.67 2.02

FINAL 718 (+48) 2175 3.03 2.24

INIT 140 420 3.00 1.37

ADD 361 923 2.56 1.20

FINAL 548 (+47) 1619 2.95 1.71

INIT 92 243 2.64 0.99

ADD 192 489 2.55 1.33

FINAL 354 (+70) 964 2.72 1.41

We can notice as well that, typically, the number of unique

N-grams for the FINAL sub-collection is bigger than the

sum of N-grams in the INIT and ADD sub-collections.

The exact number of new N-grams that is added to the

FINAL collection is presented in parenthesis. The difference

of unique new N-grams between 5- and 6-grams is small

and, as expected, there are less new 6-grams than 5-grams.

However, the number of new unique 10-grams is quite

bigger than previous ones, what may be considered counter-

intuitive. Moreover, and due to their length, 10-grams are

much less likely to correspond to false positive near-

duplicates. In consequence, we have examined those 10-

grams in FINAL occurring exactly in a message in INIT

and in a message in ADD (thus, with an exact frequency

of 2). We have found that 52% of them do contain “N

”

strings, representing short and/or telephone numbers in spam

messages, and in consequence, the matched messages belong

to the same SMS spam campaign. It must be noted that SMS

messages in the same spam campaign can use different short

and/or telephone numbers. The remaining 10-grams with a

frequency of 2 do correspond to:

∙ Other spam messages (e.g. “u are subscribed to the best

mobile content service in”).

∙ Chain letter messages extracted from the NUS SMS

Corpus (e.g. “the xmas story is peace the xmas msg is

love”).

∙ Actual duplicates contributed to the NUS SMS Corpus

(e.g. “i have been late in paying rent for the past”).

Regarding the rest of ﬁgures in Table III, the fourth, ﬁfth

and sixth columns report the total and the average number

of hits per N-gram, plus the standard deviation, for each

N-gram size and sub-collection, respectively. Only N-grams

occurring in two or more messages are reported, because

the N-grams considered are those that can correspond to

near-duplicates. For instance, there are 573 hits of the 186

unique 5-grams with frequency of two or more messages

for the INIT sub-collection, and each 5-gram occurs on an

average of 3.08 ± 1.56 messages.

As it can be expected, the longer the N-grams, the less

total number and average of matching messages, because the

probability of getting a longer match between two randomly

chosen messages is smaller. In general, the ﬁgures for INIT

messages are bigger than for ADD, what makes sense

because the proportion of spam in the ﬁrst collection is three

times the proportion in the second collection, and most of

the N-gram matches correspond to SMS spam messages.

This explains as well that the average number of matches

in the FINAL sub-collection is closer to the INIT average

than to the ADD average, as the total counts of spam

messages is 322 and 425 for these latter sub-collections,

respectively. As previously discussed, most matches come

from spam messages, that make for the near-duplicates

because of the intrinsic similarity between spam campaigns

patterns, and ADD spam messages sum up on previously

existing campaigns and patterns in the INIT sub-collection.

In other words, the spam class messages are typically more

similar among them, than the ham class, for any of the sub-

collections.

2) Top scoring N-grams: In order to compare the actual

matches between messages in the studied sub-collection, we

report the top frequent N-grams and their frequencies for

each N in the next tables. We show the ten top frequent 5

and 6-grams in Tables IV and V, respectively.

First of all, it must be noted that, given an N-gram with

counts 𝑖, 𝑗 and 𝑘 in the INIT, ADD and FINAL collections

respectively, we must not expect that 𝑖 + 𝑗 = 𝑘.Thisis

because some counts are missing as a previous N-gram

match between two messages may have been reported, and

only N-gram matches corresponding to the left most match

between two messages are summed up.

As it can be seen regarding 5-grams:

∙ 5-grams already present in the INIT and the ADD

sub-collections do not collapse to greatly increase their

frequency. For instance, the 5-grams “sorry i ll call

later” and “i cant pick the phone” do not change

Table IV: Ten top 5-grams and their frequencies in the studied sub-collections.

INIT ADD FINAL

5-gram #f 5-gram #f 5-gram #f

we are trying to contact 10 sorry i ll call later 37 sorry i ll call later 37

this is the Nnd attempt 9 private your NNNN account

statement

15 private your NNNN account

statement

urgent we are trying to 9 i cant pick the phone 12 we are trying to contact 14

prize guaranteed call

NNNNNNNNNNN from

8 hope you are having a 9 prize guaranteed call

NNNNNNNNNNN from

bonus caller prize on NN 7 text me when you re 9 you have won a guaranteed 13

draw txt music to NNNNN 7 £ NNNN cash or a 8 a NNNN prize guaranteed call 12

prize N claim is easy 7 NNN anytime any network

mins

8 draw shows that you have 12

you have won a guaranteed 7 a £ NNNN prize guaranteed 7 i cant pick the phone 12

a N NNN bonus caller 6 have a secret admirer who 7 urgent we are trying to 11

are selected to receive a 6 u have a secret admirer 7 call NNNNNNNNNNN from

land line

its frequency from ADD to FINAL. These 5-grams

correspond to templates often present in cell phones,

and used in legitimate messages. Actually, both are

complete messages themselves.

∙ The behavior of the rest of 5-grams, which all actually

nearly only occur in spam messages, is a bit different.

Most of them are fuzzy duplicates that result in small

frequency increases, like in “we are trying to contact”

from INIT (10 messages) to FINAL (14 messages).

This means that the messages in ADD may be dupli-

cates of the messages in INIT. However, as it can be

seen, the patterns of spam 5-grams within each sub-

collection are very regular and even overlapping, so

this is not signiﬁcant. In other words, these 4 messages

are not repeated, but new instances of spam probably

sent by the same organization. Other messages just

disappear from the top, as they keep their frequencies.

Regarding 6-grams, shown in Table V, that is the standard

value used in tools like WCopyﬁnd, we can see that the

behavior is quite similar to the case of 5-grams. There are

slightly different results because of two reasons:

∙ The fact that longer N-grams must obviously lead to

lower frequencies. Actually, there is not a signiﬁcant

drop in the number of matches per 6-gram, as it can

be seen in e.g. “private your NNNN account statement

for”, which includes the 5-gram “private your NNNN

account statement” as a preﬁx.

∙ The most frequent 6-grams keep on belonging to spam

messages. The 5-grams that frequently occurred on

the legitimate messages have disappeared because the

detected templates are, in fact, complete 5-length mes-

sages.

In 6-gram results, we can see again that there are not

signiﬁcant near-duplicates except for those already present

in each sub-collection. Moreover, the results of 10-grams

(not presented here due to space limit) are very similar to

these previous ones. In consequence, we believe it is safe

to say that merging the sub-collections, although they have

roughly the same sources, does not lead to near-duplicates

that may ease the task of detecting SMS spam.

IV. C

ONCLUSIONS AND FUTURE WORK

In this paper, we have performed a careful analysis of

the new SMS Spam Collection, which has been built in

order to promote the experimentation with machine learning

SMS spam classiﬁers. This collection has been developed

by enriching a previously existing SMS corpus, using the

same data sources. As a consequence, the added messages

may contain previously existing messages in the original

collection. Thus, it is required to ensure that this does not

happen, as it may ease the task of learning SMS spam

classiﬁers.

We have performed a detailed analysis of potential near-

duplicates in the collection, by using an standard “String-

to-text” method, on three sub-collections: the original one

(INIT), the added messages (ADD), and the ﬁnal collection

(FINAL). The near-duplicate detection method consists of

ﬁnding N-gram matches between messages, for N = 5, 6

and 10 within each collection, in order to verify that there

is not a signiﬁcant number of near-duplicates in the FINAL

sub-collection, apart from those previously existing in the

INIT and the ADD sub-collections.

We have found 5-grams already presented in the INIT

and the ADD sub-collections do not collapse to greatly

increase their frequencies, and they typically correspond

to templates often presented in cell phones, and used in

legitimate messages (e.g. “sorry i ll call later”). The 5-grams

that co-occur in INIT and ADD, so they get their frequencies

increased in FINAL, are new instances of spam most likely

sent by the same organization. In 6-grams results, we have

found that there are not signiﬁcant near-duplicates except for

those already presented in each sub-collection. Moreover, the

results achieved with 10-grams are very similar to the 5- and

6-grams ones.

Table V: Ten top 6-grams and their frequencies in the studied sub-collections.

INIT ADD FINAL

6-gram #f 6-gram #f 6-gram #f

this is the Nnd attempt to 9 private your NNNN account

statement for

15 private your NNNN account

statement for

urgent we are trying to

contact

9 i cant pick the phone right 12 a NNNN prize guaranteed

call NNNNNNNNNNN

prize guaranteed call

NNNNNNNNNNN from land

7 a £ NNNN prize guaranteed

call

7 draw shows that you have

won

a N NNN bonus caller prize 6 have a secret admirer who is 7 i cant pick the phone right 12

bonus caller prize on NN NN 6 iamonthewayto 6 prize guaranteed call

NNNNNNNNNNN from land

cash await collection sae t cs 6 pls convey my birthday

wishes to

6 urgent we are trying to

contact

tone N ur mob every week 6 u have a secret admirer who 6 call our customer service

representative on

you have won a guaranteed

NNNN

6 £ NNN cash every wk txt 5 this is the Nnd attempt to 9

a NNNN prize guaranteed

call NNNNNNNNNNN

5 as i entered my cabin my 5 tone N ur mob every week 9

call NNNNNNNNNNN now

only NNp per

5 goodmorning today i am late

for

5 we are trying to contact u 9

In consequence, we believe it is safe to say that merging

the sub-collections, although they have roughly the same

sources, does not lead to near-duplicates that may ease the

task of detecting SMS spam.

As a future work, we plan to perform throughout exper-

iments with machine learning content based classiﬁers in

order to conﬁrm and improve previous work by we and

others ([9], [10], and [11]) on the much smaller SMS Spam

Corpus.

CKNOWLEDGMENT

The authors would like to thank the ﬁnancial support of

Brazilian agencies FAPESP, Capes and CNPq.

EFERENCES

[1] G. Cormack, “Email Spam Filtering: A Systematic Review,”

Foundations and Trends in Information Retrieval,vol.1,

no. 4, pp. 335–455, 2008.

[2] T. A. Almeida, A. Yamakami, and J. Almeida, “Evaluation

of Approaches for Dimensionality Reduction Applied with

Naive Bayes Anti-Spam Filters,” in Proceedings of the 8th

IEEE International Conference on Machine Learning and

Applications, Miami, FL, USA, 2009, pp. 517–522.

[3] ——, “Filtering Spams using the Minimum Description

Length Principle,” in Proceedings of the 25th ACM Sympo-

sium On Applied Computing, Sierre, Switzerland, 2010, pp.

1856–1860.

[4] ——, “Probabilistic Anti-Spam Filtering with Dimensionality

Reduction,” in Proceedings of the 25th ACM Symposium On

Applied Computing, Sierre, Switzerland, 2010, pp. 1804–

1808.

[5] T. A. Almeida and A. Yamakami, “Content-Based Spam

Filtering,” in Proceedings of the 23rd IEEE International

Joint Conference on Neural Networks, Barcelona, Spain,

2010, pp. 1–7.

[6] T. A. Almeida, J. Almeida, and A. Yamakami, “Spam Filter-

ing: How the Dimensionality Reduction Affects the Accuracy

of Naive Bayes Classiﬁers,” Journal of Internet Services and

Applications, vol. 1, no. 3, pp. 183–200, 2011.

[7] T. A. Almeida and A. Yamakami, “Facing the Spammers:

A Very Effective Approach to Avoid Junk E-mails,” Expert

Systems with Applications, vol. 39, pp. 6557–6561, 2012.

[8] T. Almeida, J. Gómez Hidalgo, and A. Yamakami, “Contri-

butions to the Study of SMS Spam Filtering: New Collection

and Results,” in Proceedings of the 2011 ACM Symposium

on Document Engineering, Mountain View, CA, USA, 2011,

pp. 259–262.

[9] G. V. Cormack, J. M. Gómez Hidalgo, and E. Puertas Sanz,

“Feature Engineering for Mobile (SMS) Spam Filtering,” in

Proceedings of the 30th Annual International ACM SIGIR

Conference on Research and Development in Information

Retrieval, New York, NY, USA, 2007, pp. 871–872.

[10] ——, “Spam Filtering for Short Messages,” in Proceedings of

the 16th ACM Conference on Conference on information and

Knowledge Management, Lisbon, Portugal, 2007, pp. 313–

320.

[11] J. M. Gómez Hidalgo, G. Cajigas Bringas, E. Puertas Sanz,

and F. Carrero García, “Content Based SMS Spam Filtering,”

in Proceedings of the 2006 ACM Symposium on Document

Engineering, Amsterdam, The Netherlands, 2006, pp. 107–

114.

[12] J. P. Kumar and P. Govindarajulu, “Duplicate and near du-

plicate documents detection: A review,” European Journal of

Scientiﬁc Research, vol. 32, pp. 514–527, 2009.

[13] A. M. El Tahir Ali, H. M. Dahwa Abdulla, and V. Snasel,

“Survey of Plagiarism Detection Methods,” in Proceedings

of the 5th Asia Modelling Symposium, Manila, Philippines,

2011, pp. 39–42.

A Comprehensive Approach to SMS Spam Filtering Integrating Embedded and Statistical Features

Conference Paper

Full-text available

Nov 2023

Modern society relies heavily on mobile phones to communicate. One of the most valuable mobile phone services is SMS (Short Message Service), which simplifies communication greatly. There have been spammers who have misused this platform by sending inappropriate messages to users, provoking them and costing them money. Due to imbalanced data, unclear semantics, and the inability to extract sufficient features from short messages, SMS spam can be difficult to filter. While spam messages have been filtered so far using various methods, their accuracy is still a work in progress. This study uses embeddings and TF-IDF to provide more information from short text messages while improving SMS spam filtering accuracy. The proposed approach was tested on a real dataset. Experiments analyzing evaluation parameters demonstrate that this model is effective.

ChatGPT: Jack of all trades, master of none

Article

Full-text available

Jun 2023
INFORM FUSION

OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25\% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.

A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning

Article

Full-text available

Apr 2023

In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.

An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection application

Article

Full-text available

Mar 2023

The random forest algorithm could be enhanced and produce better results with a well-designed and organized feature selection phase. The dependency structure between the variables is considered to be the most important criterion behind selecting the variables to be used in the algorithm during the feature selection phase. As the dependency structure is mostly nonlinear, making use of a tool that considers nonlinearity would be a more beneficial approach. Copula-Based Clustering technique (CoClust) clusters variables with copulas according to nonlinear dependency. We show that it is possible to achieve a remarkable improvement in CPU times and accuracy by adding the CoClust-based feature selection step to the random forest technique. We work with two different large datasets, namely, the MIMIC-III Sepsis Dataset and the SMS Spam Collection Dataset. The first dataset is large in terms of rows referring to individual IDs, while the latter is an example of longer column length data with many variables to be considered. In the proposed approach, first, random forest is employed without adding the CoClust step. Then, random forest is repeated in the clusters obtained with CoClust. The obtained results are compared in terms of CPU time, accuracy and ROC (receiver operating characteristic) curve. CoClust clustering results are compared with K-means and hierarchical clustering techniques. The Random Forest, Gradient Boosting and Logistic Regression results obtained with these clusters and the success of RF and CoClust working together are examined.

Chatgpt: Jack of All Trades, Master of None

Article

Full-text available

Jan 2023

ChatGPT: Jack of all trades, master of none

Preprint

Full-text available

Feb 2023

An Ensemble Learning Approach for SMS Spam Detection

Conference Paper

Full-text available

May 2023

One of the most accessible ways to communicate via text is through a short message service. In recent years, profit-seeking people have taken advantage of the good features of this service to send large numbers of spam messages to random people for malicious purposes. In this respect, detecting spam messages is an important task. The unbalanced proportion of the spam and ham data and the extraction of efficient features from short messages have been the main challenges in the SMS spam detection problem. So far, various methods have been proposed to filter spam messages, whose accuracy still needs to be improved. In this study, we propose an ensemble learning method based on random forest and logistic regression algorithms to increase the accuracy of SMS spam detection. The proposed approach has been tested on two real datasets. The experimental evaluation based on accuracy and AUC shows the effectiveness of the proposed ensemble learning algorithm.

MOBILE SMS SPAM DETECTION USING MACHINE LEARNING TECHNIQUES

Article

Full-text available

Jan 2021

Samadhan Nagare

This paper analysis the paper the method of intelligent spam filtering techniques during the SMS (Short Message Services) takes paradigm, in the context of mobile text messages spam. The unique characteristics of the SMS contents be indicative of the fact that all approaches cannot be equally effective or efficient. This paper compares some of the trendy mobile SMS spam filtering techniques on a publicly available SMS spam corpus, to categorize the methods that work best in the SMS text context. This can give hints on optimized SMS spam detection for mobile text messages.

Mobile SMS Spam Detection using Machine Learning Techniques.

Article

Jan 2021

Samadhan Nagare

Spam SMS be unwanted messages to users, which be worrying and from time to time damaging. present be a group of survey papers available on SMS spam detection techniques. study and reviewed their used techniques, approaches and algorithms, their advantages and disadvantages, evaluation measures, discussion on datasets as well as lastly end result judgment of the studies. even though, the SMS spam detection techniques are additional demanding than SMS spam detection techniques since of the local contents, use of shortened words, unluckily not any of the existing research addresses these challenges. There is a enormous scope of upcoming research in this region and this survey can act as a reference point for the upcoming direction of research.

Adversarial Attack on Hyperdimensional Computing-based NLP Applications

Conference Paper

Apr 2023

SMS Spam Collection v.1

Data

Full-text available

Sep 2011

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

Facing the spammers: A very effective approach to avoid junk e-mails

Article

Full-text available

Jun 2012
EXPERT SYST APPL

Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle and confidence factors. The proposed model is fast to construct and incrementally updateable. Furthermore, we have conducted an empirical experiment using three well-known, large and public e-mail databases. The results indicate that the proposed classifier outperforms the state-of-the-art spam filters.

Robust nonlinear real-time control strategy to stabilize a PVTOL aircraft in crosswind

Conference Paper

Full-text available

Nov 2010
Rep U S

A robust control strategy to stabilize a PVTOL aircraft in the presence of crosswind is proposed in this paper. The approach makes use of Robust Control Lyapunov Functions (RCLF) and Sontag's universal stabilizing feedback. A nonlinear dynamic model of the aircraft taking account the crosswind has been developed. Likewise, a robust nonlinear control strategy is proposed to stabilize the PVTOL aircraft using RCLF, and we have employed the Riccati equation's parameters to compute and tune it in real-time. To validate the proposed control strategy, various simulations have been carried out. The controller has been also applied in real-time to a PVTOL prototype undergoing crosswinds. The experimental results show the good performance of the control algorithm.

Sontag, E.D.: A ’universal’ construction of Artstein’s theorem on nonlinear stabilization. System and Control Letters 13(2), 117-123

Article

Full-text available

Aug 1989
SYST CONTROL LETT

Eduardo D Sontag

This note presents an explicit proof of the theorem - due to Artstein - which states that the existence of a smooth control-Lyapunov function implies smooth stabilizability. Moreover, the result is extended to the real-analytic and rational cases as well. The proof uses a ‘universal’ formula given by an algebraic function of Lie derivatives; this formula originates in the solution of a simple Riccati equation.

Contributions to the study of SMS spam filtering: new collection and results.

Conference Paper

Full-text available

Jan 2011

Fundamentals of Airplane Flight Mechanics

Article

Jan 2007

David G. Hull

Airplane flight mechanics is the application of Newton's laws to the study of airplane trajectories (performance), stability, and aerodynamic control. This text is limited to flight in a vertical plane and is divided into two parts. The first part, trajectory analysis, is concerned primarily with the derivation of analytical solutions of trajectory problems associated with the sizing of commercial jets, that is, take-off, climb, cruise, descent, and landing, including trajectory optimization. The second part, stability and control, is further classified as static or dynamic. On each iteration of airplane sizing, the center of gravity is placed so that the airplane is statically stable. Dynamic stability and control is included to study the response of an airplane to control and gust inputs, which is needed for the design of automatic flight control systems. Algorithms are presented for estimating lift, drag, pitching moment, and stability derivatives. Flight mechanics is a discipline. As such, it has equations of motion, acceptable approximations, and solution techniques for the approximate equations of motion. Once an analytical solution has been obtained, numbers are calculated in order to compare the answer with the assumptions used to derive it and to acquaint students with the sizes of the numbers. A subsonic business jet is used for these calculations.

Aircraft Control and Simulation

Book

Jan 1992
Aircraft Eng Aero Tech

List of Tables. List of Examples. Preface. 1 The Kinematics and Dynamics of Aircraft Motion. 1.1 Introduction. 1.2 Vector Kinematics. 1.3 Matrix Analysis of Kinematics. 1.4 Geodesy, Earth's Gravitation, Terrestrial Navigation. 1.5 Rigid-Body Dynamics. 1.6 Summary. 2 Modeling the Aircraft. 2.1 Introduction. 2.2 Basic Aerodynamics. 2.3 Aircraft Forces and Moments. 2.4 Static Analysis. 2.5 The Nonlinear Aircraft Model. 2.6 Linear Models and the Stability Derivatives. 2.7 Summary. 3 Modeling, Design, and Simulation Tools. 3.1 Introduction. 3.2 State-Space Models. 3.3 Transfer Function Models. 3.4 Numerical Solution of the State Equations. 3.5 Aircraft Models for Simulation. 3.6 Steady-State Flight. 3.7 Numerical Linearization. 3.8 Aircraft Dynamic Behavior. 3.9 Feedback Control. 3.10 Summary. 4 Aircraft Dynamics and Classical Control Design. 4.1 Introduction. 4.2 Aircraft Rigid-Body Modes. 4.3 The Handling-Qualities Requirements. 4.4 Stability Augmentation. 4.5 Control Augmentation Systems. 4.6 Autopilots. 4.7 Nonlinear Simulation. 4.8 Summary. 5 Modern Design Techniques. 5.1 Introduction. 5.2 Assignment of Closed-Loop Dynamics. 5.3 Linear Quadratic Regulator with Output Feedback. 5.4 Tracking a Command. 5.5 Modifying the Performance Index. 5.6 Model-Following Design. 5.7 Linear Quadratic Design with Full State Feedback. 5.8 Dynamic Inversion Design. 5.9 Summary. 6 Robustness and Multivariable Frequency-Domain Techniques. 6.1 Introduction. 6.2 Multivariable Frequency-Domain Analysis. 6.3 Robust Output-Feedback Design. 6.4 Observers and the Kalman Filter. 6.5 LQG/Loop-Transfer Recovery. 6.6 Summary. 7 Digital Control. 7.1 Introduction. 7.2 Simulation of Digital Controllers. 7.3 Discretization of Continuous Controllers. 7.4 Modified Continuous Design. 7.5 Implementation Considerations. 7.6 Summary. Appendix A F-16 Model. Appendix B Software. Index.

Duplicate and near duplicate documents detection: A review

Article

Jun 2009

The development of Internet has resulted in the flooding of numerous copies of web documents in the search results making them futilely relevant to the users thereby creating a serious problem for internet search engines. The outcome of perpetual growth of Web and e-commerce has led to the increase in demand of new Web sites and Web applications. Duplicated web pages that consist of identical structure but different data can be regarded as clones. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. The problem has been deliberated for diverse data types (e.g. textual documents, spatial points and relational records) in diverse settings. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data and high dimensionalities of the documents. This survey paper has a fundamental intention to present an up-to-date review of the existing literature in duplicate and near duplicate detection of general documents and web documents in web crawling. Besides, the classification of the existing literature in duplicate and near duplicate detection techniques and a detailed description of the same are presented so as to make the survey more comprehensible. Additionally a brief introduction of web mining, web crawling, and duplicate document detection are also presented.

Survey of Plagiarism Detection Methods

Conference Paper

Jun 2011

Plagiarism has become one area of interest for researchers due to its importance, and its fast growing rates. In this paper we are going to survey and list the advantage sand disadvantages of the latest and the important effective methods used or developed in automatic plagiarism detection, according to their result. Mainly methods used in natural language text detection, index structure, and external plagiarism detection and clustering - based detection.

Content-based spam filtering

Conference Paper

Aug 2010

On the Validity of a New SMS Spam Collection

Abstract

Recommended publications

Mobile phones and mobile advertising: an Asian perspective

M-Logger

Position Paper for CHI 2000 Workshop No 14: Towards a Deeper Understanding of Task Interruption

E-mail-based Education Environment Using Mobile Phone Communication