Conference PaperPDF Available

On the Validity of a New SMS Spam Collection

Authors:

Abstract

Mobile phones are becoming the latest target of electronic junk mail. Recent reports clearly indicate that the volume of SMS spam messages are dramatically increasing year by year. Probably, one of the major concerns in academic settings was the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. To address this issue, we have recently proposed a new SMS Spam Collection that, to the best of our knowledge, is the largest, public and real SMS dataset available for academic studies. However, as it has been created by augmenting a previously existing database built using roughly the same sources, it is sensible to certify that there are no duplicates coming from them. So, in this paper we offer a comprehensive analysis of the new SMS Spam Collection in order to ensure that this does not happen, since it may ease the task of learning SMS spam classifiers and, hence, it could compromise the evaluation of methods. The analysis of results indicate that the procedure followed does not lead to near-duplicates and, consequently, the proposed dataset is reliable to use for evaluating and comparing the performance achieved by different classifiers.
On the Validity of a New SMS Spam Collection
José María Gómez Hidalgo
R&D Department
Optenet
Las Rozas, Madrid, Spain
jgomez@optenet.com
Tiago A. Almeida
Department of Computer Science
Federal University of São Carlos UFSCar
Sorocaba, São Paulo, Brazil
talmeida@ufscar.br
Akebo Yamakami
School of Electrical and Computer Engineering
University of Campinas UNICAMP
Campinas, São Paulo, Brazil
akebo@dt.fee.unicamp.br
Abstract—Mobile phones are becoming the latest target of
electronic junk mail. Recent reports clearly indicate that the
volume of SMS spam messages are dramatically increasing
year by year. Probably, one of the major concerns in academic
settings was the scarcity of public SMS spam datasets, that
are sorely needed for validation and comparison of different
classifiers. To address this issue, we have recently proposed a
new SMS Spam Collection that, to the best of our knowledge, is
the largest, public and real SMS dataset available for academic
studies. However, as it has been created by augmenting a
previously existing database built using roughly the same
sources, it is sensible to certify that there are no duplicates
coming from them. So, in this paper we offer a comprehensive
analysis of the new SMS Spam Collection in order to ensure
that this does not happen, since it may ease the task of
learning SMS spam classifiers and, hence, it could compromise
the evaluation of methods. The analysis of results indicate
that the procedure followed does not lead to near-duplicates
and, consequently, the proposed dataset is reliable to use
for evaluating and comparing the performance achieved by
different classifiers.
Keywords-Spam filtering; Mobile spam; Text categorization;
Classification; Text analysis.
I. INTRODUCTION
Text messaging is a communication service component of
phone, web or mobile communication systems, using stan-
dardized communications protocols that allow the exchange
of short text messages between fixed line or mobile phone
devices. While the original term was derived from referring
to messages sent using the Short Message Service (SMS),
it has since been extended to include messages containing
image, video, and audio.
Mobile text messages are commonly used between mobile
phone users, as a substitute for voice calls in situations
where voice communication is impossible or undesirable.
Such way of communication is also very popular because
in some places text messages are significantly cheaper than
placing a phone call to another mobile phone.
Messaging still dominates mobile market non-voice rev-
enues worldwide. According to a report recently provided by
Portio Research
1
, the worldwide mobile messaging market
was worth USD 179.2 billion in 2010, has passed USD 200
billion in 2011, and probably will reach USD 300 billion in
2014. The same study indicates that annual worldwide SMS
1
http://www.portioresearch.com/MMF11-15.html
traffic volumes rose to over 6.9 trillion at end-2010 to break
8 trillion by end-2011.
Mobile messages can be used to interact with automated
systems such as ordering products and services for mobile
phones or participating in contests. Service providers and
advertisers use direct text marketing to notify mobile phone
users about promotions, payment due dates and other noti-
fications that can usually be sent by post or e-mail.
The downside is that cell phones are becoming the latest
target of electronic junk mail, with a growing number of
marketers using text messages to target subscribers. SMS
spam (sometimes also called mobile phone spam) is any
unwanted or unsolicited text message received on a mobile
device. Although this practice is rare in North America, it
has been very common in some parts of Asia.
SMS text messaging offers a target rich environment for
spammers. With the explosive growth in text messaging
along with unlimited texting plans it barely costs anything
for the attackers to send malicious messages. This combined
with the trust users inherently have in their mobile devices
makes it an environment rife for attack. In fact, a recently
Cloudmark company research
2
reveals that financial fraud
and spam via text messages is now growing at a rate of
over 300 percent year over year.
In the same way that carriers are facing a real challenge in
dealing with SMS spam, academic researchers in this field
are also experiencing some difficulties. Probably, one of the
major concern corresponds to the lack of large, real and
public databases. Unlike the large amount of available email
spam datasets [1], [2], [3], [4], [5], [6], [7], there are very
few corpora with real examples of mobile phone spam, and
to make matters worse, they are usually of small size.
To fill this important gap, we have recently proposed the
new SMS Spam Collection [8], which is a real, public,
non-encoded, and the largest SMS spam corpus as far as
we know. However, it have been created by augmenting a
previously existing database built using roughly the same
sources. Thus, it is very important to verify if there are
some duplicates coming from other databases, since added
messages may contain previously existing messages in the
original collection. In this way, in this paper we have per-
formed a detailed analysis of the new SMS Spam Collection
in order to ensure that this does not happen, as it may ease
the task of learning SMS spam classifiers.
2
http://blog.cloudmark.com/2011/12/05/surge-in-
financial-related-mobile-spam-in-q4
This paper is organized as follows: Section II presents the
new SMS Spam Collection. A comprehensive near-duplicate
analysis of the new SMS Spam Collection with the main
results are presented in Section III. Finally, in Section IV,
we offer conclusions and outlines for future work.
II. T
HE SMS SPAM COLLECTION
Reliable data are essential in any scientific research. It is
a common sense that the absence of representative data can
seriously impact the processes of evaluation and comparison
of methods and unfortunately, areas of more recent studies
are generally affected by the lack of public available data.
To address the lack of SMS spam datasets, in [8], we
propose a new real, public and non-encoded SMS Spam Col-
lection
3
that is the largest one as far as we know. Moreover,
we offer a comprehensive performance evaluation comparing
several established machine learning methods in order to
provide good baseline results for further comparison.
As pointed out in [8], to create the SMS Spam Collection
we have collected data derived from different sources.
First, a set of 425 SMS spam messages was manually
extracted from the Grumbletext Web site
4
.ThisisaUK
forum in which cell phone users make public claims about
SMS spam messages, most of them without reporting the
very spam message received. The identification of the text
of spam messages in the claims is a very hard and time-
consuming task, and it involved carefully scanning hundreds
of web pages.
We have also added legitimate samples by inserting 450
SMS messages collected from Caroline Tag’s PhD Thesis
5
.
Furthermore, we have selected 3,375 SMS ham messages
randomly chosen of the NUS SMS Corpus
6
.
Finally, we have incorporated the SMS Spam Corpus v.0.1
Big
7
that is composed by 1,002 SMS ham messages and 322
spam messages. More detail about this dataset can be found
in [9], [10], and [11]. However, it is important to point out
that the sources used for building this corpus are almost the
same used to create the new SMS Spam Collection.
Despite the importance of the new collection in a scenario
that requires a lot of such data, the SMS Spam Collection
was created with messages of previously existing database
built using roughly the same sources. Therefore, at this
stage, it is very important to perform a careful analysis of
the validity of the proposed dataset checking if there are
duplicates coming from both databases.
III. D
UPLICATE ANALYSIS OF THE SMS SPAM
COLLECTION
To ensure that the way the SMS Spam Collection has
built, by reusing the same message sources, does not lead to
3
The SMS Spam Collection is available at http://www.dt.fee.
unicamp.br/˜tiago/smsspamcollection
4
The Grumbletext Web site is available at http://www.
grumbletext.co.uk/
5
The Caroline Tag’s PhD Thesis is available at http://
etheses.bham.ac.uk/253/1/Tagg09P hD.pdf
6
The NUS SMS Corpus is available at http://www.comp.
nus.edu.sg/˜rpnlpir/downloads/corpora/smsCorpus/
7
The SMS Spam Corpus v.0.1 Big is public available at http://
www.esp.uem.es/jmgomez/smsspamcorpus/
invalid SMS spam filtering results, it is needed to study the
potential overlap between the sub-collections that have been
used when building it. The hypothesis is that the messages
added to the original SMS collection, even extracted from
the same sources (the Grumbletext site, the NUS SMS
Corpus), do not add duplicates to those previously existing
messages, except for those previously existing in the original
collection or the messages sources themselves. In this way,
if there are duplicates in the final collection, the only causes
can be:
Spammers do use templates when writing their spam
messages.
Legitimate users do make use of message templates
existing in their mobile phones.
Legitimate users do re-send chain letters (e.g. jokes,
Christmas messages, etc.).
So, if the task of SMS spam filtering is eased because of
these duplicate messages, the reason for this is the actual
behavior of SMS messaging by spammers and legitimate
users, and not the way the collection used for testing was
built.
In consequence, we have built three SMS sub-collections
described below (original, added and all messages), and we
have studied the most frequent duplicates in all the sub-
collections. The hypothesis gets confirmed if:
1) The existing duplicates in the original sub-collection
keep the same frequency statistics in the final collec-
tion, and
2) the existing duplicates in the added messages keep the
same frequency statistics in the final collection as well.
In the next sections, we describe the three sub-collections
used in the study, along with the approach we have used to
detect message duplicates, or more properly, near-duplicates.
We detail the results of the analysis, which confirm our
hypothesis.
A. Text collections
In order to evaluate the potential overlap between the
datasets which were used to build the proposed SMS Spam
Collection, we have searched for near-duplicates within three
sub-collections:
The previously existing SMS Spam Corpus v.0.1 Big
(INIT).
The SMS collection that includes the additional mes-
sages from Grumbletext, the NUS SMS Corpus, and
the Tag’s PhD Thesis (ADD).
The released SMS Spam Collection (FINAL).
The INIT dataset has a total of 1,324 text messages where
1,002 are ham and 322 are spam. The ADD sub-collection
is composed by 3,825 legitimate messages and 425 mobile
spam messages, for a total of 4,250 text messages. The
percentages of ham and spam are shown in Table I.
It is worth noticing that the previously existing SMS
Spam Corpus v.0.1 Big, which corresponds to the INIT
sub-collection, poses a simpler problem to machine learning
content based spam filters, as the collection is more balanced
than the new SMS Spam Collection. On the other side, the
new collection is much bigger, and more data often implies
better learning generalization.
Table I: How the sub-collections are composed.
INIT ADD
Class Amount Pct Amount Pct
Ham 1,002 75.68 3,825 90.00
Spam 322 24.32 425 10.00
Total 1,324 100.00 4,250 100.00
In Table II we present the main statistics related to the
tokens extracted from the INIT and ADD sub-collections.
Table II: Basic statistics related to the tokens extracted from
the sub-collections.
INIT ADD
Ham 12,192 51,419
Spam 7,682 9,861
Total 19,874 61,280
Avg per Msg 15.01 14.42
Avg in Ham 12.17 13.44
Avg in Spam 23.86 23.20
Note that, for both sub-collections, mobile phone spams
are in average ten tokens larger than legitimate messages.
Also note that the average tokens per message is quite
similar in both sub-collections.
B. Near-duplicate detection approach
For the particular needs of this study, and given the short
nature of SMS messages, the “String-of-Text” method can
be considered as a reasonable baseline for the purpose of
detecting near-duplicated messages in our collection.
The “String-of-Text” method, implemented by the
WCopyfind
8
tool, involves scanning suspect texts for ap-
proximately matching character sequences. In order to avoid
little manual modifications, this approximation can involve
transformations like case changing, separators variation (e.g.
addressing those users including more white spaces between
words), etc.
The “String-of-Text” method is a simplified version of the
general N-gram matching detection method, widely used in
the literature [12], [13]. An N-gram is an ordered sequence
of tokens or words present in a text, in which N is the
number of tokens.
For this purpose, texts are compared searching for N-
grams for relatively big sizes (e.g. N=6), with additional
parameters (length of match in number of characters, etc.).
This approach is implemented in WCopyfind, but we have
simplified it to N-gram matches after text normalization
involving:
Replacing all token separators by white spaces.
Lowercasing all characters.
Replacing digits by the character ‘N’ (to preserve phone
numbers structure).
For instance, the 6-gram stop to NNNNN
customer services NNNNNNNNNNN corresponds to
a match between the next two messages within the ADD
8
See: http://plagiarism.phys.virginia.edu
sub-collection:
Thank you, winner notified by sms. Good
Luck! No future marketing reply STOP
to 84122 customer services 08450542832
and
Your unique user ID is 1172. For removal
send STOP to 87239 customer services
08708034412
As it can be seen, both messages are not near-duplicates;
instead, they share a common pattern in messages reported
by users as SMS spam in the Grumbletext site, which is the
matching 6-gram. In particular, both messages correspond to
two different SMS advertising campaigns in which the users
have actually not subscribed the service.
In consequence, this near-duplicate approach, especially
with relatively short N-Grams, can lead to many false
positives. As a result, the statistics collected during our
analysis represent an upper bound of the potential near-
duplicates that occur in the final collection. In our opinion,
this is safer than finding a lower bound, because in this way
no near-duplicates will be missing, and the conclusions of
the study are sound.
In order to find matching N-grams and message near-
duplicates within a given sub-collection, we have followed
the next procedure:
1) All messages within the sub-collection are taken as a
sorted list.
2) Each N-gram for a message is built from left to right.
3) A match or hit is registered when an N-gram present
in a message 𝑖 is found in a message 𝑗, with 𝑖<𝑗.
4) If a hit for messages 𝑖 and 𝑗 is registered, no other
matches between those messages are stored.
5) All N-grams occurring in two or more messages are
stored, along with the number of messages in which
they occur.
Thus, if a particular N-gram is present in messages 𝑖, 𝑗
and 𝑘 with 𝑖<𝑗<𝑘, only the hits for 𝑖 and 𝑗, and for 𝑗 and
𝑘 are counted. It must be noted that it is possible that there
is a match between messages 𝑖 and 𝑗, and another match
between 𝑗 and 𝑘, but not between 𝑖 and 𝑘 because both
previous matching N-grams are different (although they may
have some overlap). In consequence, the way we compare
SMS messages is not symmetric.
It is worth noting that it may be the case that two messages
have several N-grams in common. In fact, that would be the
case for full long duplicate messages. In this situation, only
the first left N-gram is reported, and then other co-occurring
N-grams may be missing counts for yet other messages.
C. Results and analysis
The goal of this process is to check if merging the
first two sub-collections adds many near-duplicates to the
final database, in order to assess the overlap between both
collections. Within each sub-collection, we have compared
each pair of messages, stored all N size matches (N-grams
with N = 5, 6, and 10), and sorted the N-grams according
to their frequency, examining in detail the top ten ones per
N. According to the literature, N = 6 is a typical number for
detecting near-duplicate paragraphs, and we have tested N =
5 because some messages were exactly this long, but there
are not nearly shorter messages. Moreover, while N = 5 or
N = 6 can lead to many false positives, these hits can be
refined with the longer matches required with N = 10, which
in turn is quite close to the actual message length average.
1) Frequency results: We show the overall N-gram
occurrence statistics for N = 5, 6 and 10 in the INIT,
ADD and FINAL sub-collections in Table III. In the third
column, we list the number of unique N-grams with 2 or
more occurrences for a given size in each sub-collection. As
it can be expected, we can view that the numbers increase
with the the number of messages in each sub-collection.
Table III: N-gram occurrence statistics for different sizes in
the studied sub-collections.
N sub #uniq sum avg std
5
INIT 186 573 3.08 1.56
ADD 484 1292 2.67 2.02
FINAL 718 (+48) 2175 3.03 2.24
6
INIT 140 420 3.00 1.37
ADD 361 923 2.56 1.20
FINAL 548 (+47) 1619 2.95 1.71
10
INIT 92 243 2.64 0.99
ADD 192 489 2.55 1.33
FINAL 354 (+70) 964 2.72 1.41
We can notice as well that, typically, the number of unique
N-grams for the FINAL sub-collection is bigger than the
sum of N-grams in the INIT and ADD sub-collections.
The exact number of new N-grams that is added to the
FINAL collection is presented in parenthesis. The difference
of unique new N-grams between 5- and 6-grams is small
and, as expected, there are less new 6-grams than 5-grams.
However, the number of new unique 10-grams is quite
bigger than previous ones, what may be considered counter-
intuitive. Moreover, and due to their length, 10-grams are
much less likely to correspond to false positive near-
duplicates. In consequence, we have examined those 10-
grams in FINAL occurring exactly in a message in INIT
and in a message in ADD (thus, with an exact frequency
of 2). We have found that 52% of them do contain “N
+
strings, representing short and/or telephone numbers in spam
messages, and in consequence, the matched messages belong
to the same SMS spam campaign. It must be noted that SMS
messages in the same spam campaign can use different short
and/or telephone numbers. The remaining 10-grams with a
frequency of 2 do correspond to:
Other spam messages (e.g. “u are subscribed to the best
mobile content service in”).
Chain letter messages extracted from the NUS SMS
Corpus (e.g. “the xmas story is peace the xmas msg is
love”).
Actual duplicates contributed to the NUS SMS Corpus
(e.g. “i have been late in paying rent for the past”).
Regarding the rest of figures in Table III, the fourth, fifth
and sixth columns report the total and the average number
of hits per N-gram, plus the standard deviation, for each
N-gram size and sub-collection, respectively. Only N-grams
occurring in two or more messages are reported, because
the N-grams considered are those that can correspond to
near-duplicates. For instance, there are 573 hits of the 186
unique 5-grams with frequency of two or more messages
for the INIT sub-collection, and each 5-gram occurs on an
average of 3.08 ± 1.56 messages.
As it can be expected, the longer the N-grams, the less
total number and average of matching messages, because the
probability of getting a longer match between two randomly
chosen messages is smaller. In general, the figures for INIT
messages are bigger than for ADD, what makes sense
because the proportion of spam in the first collection is three
times the proportion in the second collection, and most of
the N-gram matches correspond to SMS spam messages.
This explains as well that the average number of matches
in the FINAL sub-collection is closer to the INIT average
than to the ADD average, as the total counts of spam
messages is 322 and 425 for these latter sub-collections,
respectively. As previously discussed, most matches come
from spam messages, that make for the near-duplicates
because of the intrinsic similarity between spam campaigns
patterns, and ADD spam messages sum up on previously
existing campaigns and patterns in the INIT sub-collection.
In other words, the spam class messages are typically more
similar among them, than the ham class, for any of the sub-
collections.
2) Top scoring N-grams: In order to compare the actual
matches between messages in the studied sub-collection, we
report the top frequent N-grams and their frequencies for
each N in the next tables. We show the ten top frequent 5
and 6-grams in Tables IV and V, respectively.
First of all, it must be noted that, given an N-gram with
counts 𝑖, 𝑗 and 𝑘 in the INIT, ADD and FINAL collections
respectively, we must not expect that 𝑖 + 𝑗 = 𝑘.Thisis
because some counts are missing as a previous N-gram
match between two messages may have been reported, and
only N-gram matches corresponding to the left most match
between two messages are summed up.
As it can be seen regarding 5-grams:
5-grams already present in the INIT and the ADD
sub-collections do not collapse to greatly increase their
frequency. For instance, the 5-grams “sorry i ll call
later” and “i cant pick the phone” do not change
Table IV: Ten top 5-grams and their frequencies in the studied sub-collections.
INIT ADD FINAL
5-gram #f 5-gram #f 5-gram #f
we are trying to contact 10 sorry i ll call later 37 sorry i ll call later 37
this is the Nnd attempt 9 private your NNNN account
statement
15 private your NNNN account
statement
16
urgent we are trying to 9 i cant pick the phone 12 we are trying to contact 14
prize guaranteed call
NNNNNNNNNNN from
8 hope you are having a 9 prize guaranteed call
NNNNNNNNNNN from
13
bonus caller prize on NN 7 text me when you re 9 you have won a guaranteed 13
draw txt music to NNNNN 7 £ NNNN cash or a 8 a NNNN prize guaranteed call 12
prize N claim is easy 7 NNN anytime any network
mins
8 draw shows that you have 12
you have won a guaranteed 7 a £ NNNN prize guaranteed 7 i cant pick the phone 12
a N NNN bonus caller 6 have a secret admirer who 7 urgent we are trying to 11
are selected to receive a 6 u have a secret admirer 7 call NNNNNNNNNNN from
land line
10
its frequency from ADD to FINAL. These 5-grams
correspond to templates often present in cell phones,
and used in legitimate messages. Actually, both are
complete messages themselves.
The behavior of the rest of 5-grams, which all actually
nearly only occur in spam messages, is a bit different.
Most of them are fuzzy duplicates that result in small
frequency increases, like in “we are trying to contact”
from INIT (10 messages) to FINAL (14 messages).
This means that the messages in ADD may be dupli-
cates of the messages in INIT. However, as it can be
seen, the patterns of spam 5-grams within each sub-
collection are very regular and even overlapping, so
this is not significant. In other words, these 4 messages
are not repeated, but new instances of spam probably
sent by the same organization. Other messages just
disappear from the top, as they keep their frequencies.
Regarding 6-grams, shown in Table V, that is the standard
value used in tools like WCopyfind, we can see that the
behavior is quite similar to the case of 5-grams. There are
slightly different results because of two reasons:
The fact that longer N-grams must obviously lead to
lower frequencies. Actually, there is not a significant
drop in the number of matches per 6-gram, as it can
be seen in e.g. “private your NNNN account statement
for”, which includes the 5-gram “private your NNNN
account statement” as a prefix.
The most frequent 6-grams keep on belonging to spam
messages. The 5-grams that frequently occurred on
the legitimate messages have disappeared because the
detected templates are, in fact, complete 5-length mes-
sages.
In 6-gram results, we can see again that there are not
significant near-duplicates except for those already present
in each sub-collection. Moreover, the results of 10-grams
(not presented here due to space limit) are very similar to
these previous ones. In consequence, we believe it is safe
to say that merging the sub-collections, although they have
roughly the same sources, does not lead to near-duplicates
that may ease the task of detecting SMS spam.
IV. C
ONCLUSIONS AND FUTURE WORK
In this paper, we have performed a careful analysis of
the new SMS Spam Collection, which has been built in
order to promote the experimentation with machine learning
SMS spam classifiers. This collection has been developed
by enriching a previously existing SMS corpus, using the
same data sources. As a consequence, the added messages
may contain previously existing messages in the original
collection. Thus, it is required to ensure that this does not
happen, as it may ease the task of learning SMS spam
classifiers.
We have performed a detailed analysis of potential near-
duplicates in the collection, by using an standard “String-
to-text” method, on three sub-collections: the original one
(INIT), the added messages (ADD), and the final collection
(FINAL). The near-duplicate detection method consists of
finding N-gram matches between messages, for N = 5, 6
and 10 within each collection, in order to verify that there
is not a significant number of near-duplicates in the FINAL
sub-collection, apart from those previously existing in the
INIT and the ADD sub-collections.
We have found 5-grams already presented in the INIT
and the ADD sub-collections do not collapse to greatly
increase their frequencies, and they typically correspond
to templates often presented in cell phones, and used in
legitimate messages (e.g. “sorry i ll call later”). The 5-grams
that co-occur in INIT and ADD, so they get their frequencies
increased in FINAL, are new instances of spam most likely
sent by the same organization. In 6-grams results, we have
found that there are not significant near-duplicates except for
those already presented in each sub-collection. Moreover, the
results achieved with 10-grams are very similar to the 5- and
6-grams ones.
Table V: Ten top 6-grams and their frequencies in the studied sub-collections.
INIT ADD FINAL
6-gram #f 6-gram #f 6-gram #f
this is the Nnd attempt to 9 private your NNNN account
statement for
15 private your NNNN account
statement for
16
urgent we are trying to
contact
9 i cant pick the phone right 12 a NNNN prize guaranteed
call NNNNNNNNNNN
12
prize guaranteed call
NNNNNNNNNNN from land
7 a £ NNNN prize guaranteed
call
7 draw shows that you have
won
12
a N NNN bonus caller prize 6 have a secret admirer who is 7 i cant pick the phone right 12
bonus caller prize on NN NN 6 iamonthewayto 6 prize guaranteed call
NNNNNNNNNNN from land
12
cash await collection sae t cs 6 pls convey my birthday
wishes to
6 urgent we are trying to
contact
11
tone N ur mob every week 6 u have a secret admirer who 6 call our customer service
representative on
10
you have won a guaranteed
NNNN
6 £ NNN cash every wk txt 5 this is the Nnd attempt to 9
a NNNN prize guaranteed
call NNNNNNNNNNN
5 as i entered my cabin my 5 tone N ur mob every week 9
call NNNNNNNNNNN now
only NNp per
5 goodmorning today i am late
for
5 we are trying to contact u 9
In consequence, we believe it is safe to say that merging
the sub-collections, although they have roughly the same
sources, does not lead to near-duplicates that may ease the
task of detecting SMS spam.
As a future work, we plan to perform throughout exper-
iments with machine learning content based classifiers in
order to confirm and improve previous work by we and
others ([9], [10], and [11]) on the much smaller SMS Spam
Corpus.
A
CKNOWLEDGMENT
The authors would like to thank the financial support of
Brazilian agencies FAPESP, Capes and CNPq.
R
EFERENCES
[1] G. Cormack, “Email Spam Filtering: A Systematic Review,
Foundations and Trends in Information Retrieval,vol.1,
no. 4, pp. 335–455, 2008.
[2] T. A. Almeida, A. Yamakami, and J. Almeida, “Evaluation
of Approaches for Dimensionality Reduction Applied with
Naive Bayes Anti-Spam Filters, in Proceedings of the 8th
IEEE International Conference on Machine Learning and
Applications, Miami, FL, USA, 2009, pp. 517–522.
[3] ——, “Filtering Spams using the Minimum Description
Length Principle, in Proceedings of the 25th ACM Sympo-
sium On Applied Computing, Sierre, Switzerland, 2010, pp.
1856–1860.
[4] ——, “Probabilistic Anti-Spam Filtering with Dimensionality
Reduction, in Proceedings of the 25th ACM Symposium On
Applied Computing, Sierre, Switzerland, 2010, pp. 1804–
1808.
[5] T. A. Almeida and A. Yamakami, “Content-Based Spam
Filtering, in Proceedings of the 23rd IEEE International
Joint Conference on Neural Networks, Barcelona, Spain,
2010, pp. 1–7.
[6] T. A. Almeida, J. Almeida, and A. Yamakami, “Spam Filter-
ing: How the Dimensionality Reduction Affects the Accuracy
of Naive Bayes Classifiers, Journal of Internet Services and
Applications, vol. 1, no. 3, pp. 183–200, 2011.
[7] T. A. Almeida and A. Yamakami, “Facing the Spammers:
A Very Effective Approach to Avoid Junk E-mails, Expert
Systems with Applications, vol. 39, pp. 6557–6561, 2012.
[8] T. Almeida, J. Gómez Hidalgo, and A. Yamakami, “Contri-
butions to the Study of SMS Spam Filtering: New Collection
and Results, in Proceedings of the 2011 ACM Symposium
on Document Engineering, Mountain View, CA, USA, 2011,
pp. 259–262.
[9] G. V. Cormack, J. M. Gómez Hidalgo, and E. Puertas Sanz,
“Feature Engineering for Mobile (SMS) Spam Filtering, in
Proceedings of the 30th Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, New York, NY, USA, 2007, pp. 871–872.
[10] ——, “Spam Filtering for Short Messages,” in Proceedings of
the 16th ACM Conference on Conference on information and
Knowledge Management, Lisbon, Portugal, 2007, pp. 313–
320.
[11] J. M. Gómez Hidalgo, G. Cajigas Bringas, E. Puertas Sanz,
and F. Carrero García, “Content Based SMS Spam Filtering,
in Proceedings of the 2006 ACM Symposium on Document
Engineering, Amsterdam, The Netherlands, 2006, pp. 107–
114.
[12] J. P. Kumar and P. Govindarajulu, “Duplicate and near du-
plicate documents detection: A review, European Journal of
Scientific Research, vol. 32, pp. 514–527, 2009.
[13] A. M. El Tahir Ali, H. M. Dahwa Abdulla, and V. Snasel,
“Survey of Plagiarism Detection Methods, in Proceedings
of the 5th Asia Modelling Symposium, Manila, Philippines,
2011, pp. 39–42.
... A total of 5169 SMS messages are included in this dataset, of which 4516 are considered to be ham and 653 to be Fig. 3. The proposed CNN model structure spam [18]. Fig. 5 illustrates the distribution of spam and ham data over the dataset. ...
Conference Paper
Full-text available
Modern society relies heavily on mobile phones to communicate. One of the most valuable mobile phone services is SMS (Short Message Service), which simplifies communication greatly. There have been spammers who have misused this platform by sending inappropriate messages to users, provoking them and costing them money. Due to imbalanced data, unclear semantics, and the inability to extract sufficient features from short messages, SMS spam can be difficult to filter. While spam messages have been filtered so far using various methods, their accuracy is still a work in progress. This study uses embeddings and TF-IDF to provide more information from short text messages while improving SMS spam filtering accuracy. The proposed approach was tested on a real dataset. Experiments analyzing evaluation parameters demonstrate that this model is effective.
... 6. Spam. SMS Spam Collection v.1 [67] is a dataset containing SMS contents labeled as spam or not. Here, ChatGPT had to classify an input text accordingly. ...
Article
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25\% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.
... This dataset is followed by the NSL-KDD dataset which is only a smaller version without the redundant and noisy records present in KDD'99. Additionally, none of these datasets are balanced, therefore suitable evaluation [114][115][116] in 2011. It is a labelled dataset of 5574 SMS messages, 747 spam and 4827 ham, collected from mobile phones. ...
Article
Full-text available
In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.
... The prediction is made by recording the patient data. [52]). ...
Article
Full-text available
The random forest algorithm could be enhanced and produce better results with a well-designed and organized feature selection phase. The dependency structure between the variables is considered to be the most important criterion behind selecting the variables to be used in the algorithm during the feature selection phase. As the dependency structure is mostly nonlinear, making use of a tool that considers nonlinearity would be a more beneficial approach. Copula-Based Clustering technique (CoClust) clusters variables with copulas according to nonlinear dependency. We show that it is possible to achieve a remarkable improvement in CPU times and accuracy by adding the CoClust-based feature selection step to the random forest technique. We work with two different large datasets, namely, the MIMIC-III Sepsis Dataset and the SMS Spam Collection Dataset. The first dataset is large in terms of rows referring to individual IDs, while the latter is an example of longer column length data with many variables to be considered. In the proposed approach, first, random forest is employed without adding the CoClust step. Then, random forest is repeated in the clusters obtained with CoClust. The obtained results are compared in terms of CPU time, accuracy and ROC (receiver operating characteristic) curve. CoClust clustering results are compared with K-means and hierarchical clustering techniques. The Random Forest, Gradient Boosting and Logistic Regression results obtained with these clusters and the success of RF and CoClust working together are examined.
... 6. Spam. SMS Spam Collection v.1 [53] is a dataset containing SMS contents labeled as spam or not. Here, ChatGPT had to classify an input text accordingly. ...
... 6. Spam. SMS Spam Collection v.1 [69] is a dataset containing SMS contents labeled as spam or not. Here, ChatGPT had to classify an input text accordingly. ...
Preprint
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25\% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.
Conference Paper
Full-text available
One of the most accessible ways to communicate via text is through a short message service. In recent years, profit-seeking people have taken advantage of the good features of this service to send large numbers of spam messages to random people for malicious purposes. In this respect, detecting spam messages is an important task. The unbalanced proportion of the spam and ham data and the extraction of efficient features from short messages have been the main challenges in the SMS spam detection problem. So far, various methods have been proposed to filter spam messages, whose accuracy still needs to be improved. In this study, we propose an ensemble learning method based on random forest and logistic regression algorithms to increase the accuracy of SMS spam detection. The proposed approach has been tested on two real datasets. The experimental evaluation based on accuracy and AUC shows the effectiveness of the proposed ensemble learning algorithm.
Article
Full-text available
This paper analysis the paper the method of intelligent spam filtering techniques during the SMS (Short Message Services) takes paradigm, in the context of mobile text messages spam. The unique characteristics of the SMS contents be indicative of the fact that all approaches cannot be equally effective or efficient. This paper compares some of the trendy mobile SMS spam filtering techniques on a publicly available SMS spam corpus, to categorize the methods that work best in the SMS text context. This can give hints on optimized SMS spam detection for mobile text messages.
Article
Spam SMS be unwanted messages to users, which be worrying and from time to time damaging. present be a group of survey papers available on SMS spam detection techniques. study and reviewed their used techniques, approaches and algorithms, their advantages and disadvantages, evaluation measures, discussion on datasets as well as lastly end result judgment of the studies. even though, the SMS spam detection techniques are additional demanding than SMS spam detection techniques since of the local contents, use of shortened words, unluckily not any of the existing research addresses these challenges. There is a enormous scope of upcoming research in this region and this survey can act as a reference point for the upcoming direction of research.
Data
Full-text available
The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.
Article
Full-text available
Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle and confidence factors. The proposed model is fast to construct and incrementally updateable. Furthermore, we have conducted an empirical experiment using three well-known, large and public e-mail databases. The results indicate that the proposed classifier outperforms the state-of-the-art spam filters.
Conference Paper
Full-text available
A robust control strategy to stabilize a PVTOL aircraft in the presence of crosswind is proposed in this paper. The approach makes use of Robust Control Lyapunov Functions (RCLF) and Sontag's universal stabilizing feedback. A nonlinear dynamic model of the aircraft taking account the crosswind has been developed. Likewise, a robust nonlinear control strategy is proposed to stabilize the PVTOL aircraft using RCLF, and we have employed the Riccati equation's parameters to compute and tune it in real-time. To validate the proposed control strategy, various simulations have been carried out. The controller has been also applied in real-time to a PVTOL prototype undergoing crosswinds. The experimental results show the good performance of the control algorithm.
Article
Full-text available
This note presents an explicit proof of the theorem - due to Artstein - which states that the existence of a smooth control-Lyapunov function implies smooth stabilizability. Moreover, the result is extended to the real-analytic and rational cases as well. The proof uses a ‘universal’ formula given by an algebraic function of Lie derivatives; this formula originates in the solution of a simple Riccati equation.
Conference Paper
Full-text available
The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.
Article
Airplane flight mechanics is the application of Newton's laws to the study of airplane trajectories (performance), stability, and aerodynamic control. This text is limited to flight in a vertical plane and is divided into two parts. The first part, trajectory analysis, is concerned primarily with the derivation of analytical solutions of trajectory problems associated with the sizing of commercial jets, that is, take-off, climb, cruise, descent, and landing, including trajectory optimization. The second part, stability and control, is further classified as static or dynamic. On each iteration of airplane sizing, the center of gravity is placed so that the airplane is statically stable. Dynamic stability and control is included to study the response of an airplane to control and gust inputs, which is needed for the design of automatic flight control systems. Algorithms are presented for estimating lift, drag, pitching moment, and stability derivatives. Flight mechanics is a discipline. As such, it has equations of motion, acceptable approximations, and solution techniques for the approximate equations of motion. Once an analytical solution has been obtained, numbers are calculated in order to compare the answer with the assumptions used to derive it and to acquaint students with the sizes of the numbers. A subsonic business jet is used for these calculations.
Book
List of Tables. List of Examples. Preface. 1 The Kinematics and Dynamics of Aircraft Motion. 1.1 Introduction. 1.2 Vector Kinematics. 1.3 Matrix Analysis of Kinematics. 1.4 Geodesy, Earth's Gravitation, Terrestrial Navigation. 1.5 Rigid-Body Dynamics. 1.6 Summary. 2 Modeling the Aircraft. 2.1 Introduction. 2.2 Basic Aerodynamics. 2.3 Aircraft Forces and Moments. 2.4 Static Analysis. 2.5 The Nonlinear Aircraft Model. 2.6 Linear Models and the Stability Derivatives. 2.7 Summary. 3 Modeling, Design, and Simulation Tools. 3.1 Introduction. 3.2 State-Space Models. 3.3 Transfer Function Models. 3.4 Numerical Solution of the State Equations. 3.5 Aircraft Models for Simulation. 3.6 Steady-State Flight. 3.7 Numerical Linearization. 3.8 Aircraft Dynamic Behavior. 3.9 Feedback Control. 3.10 Summary. 4 Aircraft Dynamics and Classical Control Design. 4.1 Introduction. 4.2 Aircraft Rigid-Body Modes. 4.3 The Handling-Qualities Requirements. 4.4 Stability Augmentation. 4.5 Control Augmentation Systems. 4.6 Autopilots. 4.7 Nonlinear Simulation. 4.8 Summary. 5 Modern Design Techniques. 5.1 Introduction. 5.2 Assignment of Closed-Loop Dynamics. 5.3 Linear Quadratic Regulator with Output Feedback. 5.4 Tracking a Command. 5.5 Modifying the Performance Index. 5.6 Model-Following Design. 5.7 Linear Quadratic Design with Full State Feedback. 5.8 Dynamic Inversion Design. 5.9 Summary. 6 Robustness and Multivariable Frequency-Domain Techniques. 6.1 Introduction. 6.2 Multivariable Frequency-Domain Analysis. 6.3 Robust Output-Feedback Design. 6.4 Observers and the Kalman Filter. 6.5 LQG/Loop-Transfer Recovery. 6.6 Summary. 7 Digital Control. 7.1 Introduction. 7.2 Simulation of Digital Controllers. 7.3 Discretization of Continuous Controllers. 7.4 Modified Continuous Design. 7.5 Implementation Considerations. 7.6 Summary. Appendix A F-16 Model. Appendix B Software. Index.
Article
The development of Internet has resulted in the flooding of numerous copies of web documents in the search results making them futilely relevant to the users thereby creating a serious problem for internet search engines. The outcome of perpetual growth of Web and e-commerce has led to the increase in demand of new Web sites and Web applications. Duplicated web pages that consist of identical structure but different data can be regarded as clones. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. The problem has been deliberated for diverse data types (e.g. textual documents, spatial points and relational records) in diverse settings. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data and high dimensionalities of the documents. This survey paper has a fundamental intention to present an up-to-date review of the existing literature in duplicate and near duplicate detection of general documents and web documents in web crawling. Besides, the classification of the existing literature in duplicate and near duplicate detection techniques and a detailed description of the same are presented so as to make the survey more comprehensible. Additionally a brief introduction of web mining, web crawling, and duplicate document detection are also presented.
Conference Paper
Plagiarism has become one area of interest for researchers due to its importance, and its fast growing rates. In this paper we are going to survey and list the advantage sand disadvantages of the latest and the important effective methods used or developed in automatic plagiarism detection, according to their result. Mainly methods used in natural language text detection, index structure, and external plagiarism detection and clustering - based detection.