Chapter

SMS Spam Filtering Using Machine Learning Technique

October 2020

October 2020

Authors:

Arvind Vishwakarma

National Institute of Technology (NIT) Uttarakhand

Mohd Dilshad Ansari

SRM Institute of Science and Technology

Gaurav Rai

Savitribai Phule Pune University

ResearchGate has not been able to resolve any citations for this publication.

Index-based Online Text Classification for SMS Spam Filtering

Article

Full-text available

Jun 2010

We proposed a novel index-based online text classification method, investigated two index models, and compared the performances of various index granularities for English and Chinese SMS message. Based on the proposed method, six individual classifiers were implemented according to various text features of Chinese message, which were further combined to form an ensemble classifier. The experimental results from English corpus show that the relevant feature among words can increase the classification confidence and the trigram co-occurrence feature of words is an appropriate relevant feature. The experimental results from real Chinese corpus show that the performance of classifier applying word-level index model is better than the one applying document-level index model. The trigram segment outperforms the exact segment in indexing, so it is not necessary to segment Chinese text exactly when indexing by our proposed method. Applying parallel multi-thread ensemble learning, our proposed method has constant time complexity, which is critical to large scale data and online filtering.

Facing the spammers: A very effective approach to avoid junk e-mails

Article

Full-text available

Jun 2012
EXPERT SYST APPL

Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle and confidence factors. The proposed model is fast to construct and incrementally updateable. Furthermore, we have conducted an empirical experiment using three well-known, large and public e-mail databases. The results indicate that the proposed classifier outperforms the state-of-the-art spam filters.

Spam filtering for short messages

Conference Paper

Full-text available

Nov 2007

ABSTRACT We consider the problem,of content-based spam,filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary,information such as might be displayed by a lowbandwidth,client. Short messages often consist of only a few words, and therefore present a challenge to traditional bag-of-words based spam,filters. Using three corpora of short messages and message fields derived from real SMS, blog, and spam messages, we evaluate feature-based and compression-model-based spam filters. We observe that bagof-words filters can be improved substantially using different features, while compression-model filters perform quite well as-is. We conclude that content filtering for short messages is surprisingly effective. Categories and Subject Descriptors

Contributions to the study of SMS spam filtering: new collection and results.

Conference Paper

Full-text available

Jan 2011

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

Content based SMS spam filtering

Conference Paper

Full-text available

Oct 2006

In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam. Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.

Feature engineering for mobile (SMS) spam filtering

Conference Paper

Full-text available

Jul 2007

Mobile spam in an increasing threat that may be addressed using filtering systems like those employed against email spam. We believe that email filtering techniques require some adaptation to reach good levels of performance on SMS spam, especially regarding message representation. In order to test this assumption, we have performed experiments on SMS filtering using top performing email spam filters on mobile spam messages using a suitable feature representation, with results supporting our hypothesis.

Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters

Conference Paper

Full-text available

Dec 2009

There are different approaches able to automatically detect e-mail spam messages, and the best-known ones are based on Bayesian decision theory. However, the most of these approaches have the same difficulty: the high dimensionality of the feature space. Many term selection methods have been proposed in the literature. Nevertheless, it is still unclear how the performance of naive Bayes anti-spam filters depends on the methods applied for reducing the dimensionality of the feature space. In this paper, we compare the performance of most popular methods used as term selection techniques, such as document frequency, information gain, mutual information, X2 statistic, and odds ratio used for reducing the dimensionality of the term space with four well-known different versions of naive Bayes spam filter.

Filtering spams using the minimum description length principle

Conference Paper

Full-text available

Mar 2010

Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle. The proposed model is fast to construct and incrementally updateable. Additionally, we offer an analysis concerning the measurements usually employed to evaluate the quality of the anti-spam classifiers. In this sense, we present a new measure in order to provide a fairer comparison. Furthermore, we conducted an empirical experiment using six well-known, large and public databases. Finally, the results indicate that our approach outperforms the state-of-the-art spam filters.

Probabilistic anti-spam filtering with dimensionality reduction

Conference Paper

Full-text available

Mar 2010

One of the biggest problems of e-mail communication is the massive spam message delivery. Everyday, billion of unwanted messages are sent by spammers and this number does not stop growing. Helpfully, there are different approaches able to automatically detect and remove most of these messages, and a well-known ones are based on Bayesian decision theory. However, many machine learning techniques applied to text categorization have the same difficulty: the high dimensionality of the feature space. Many term selection methods have been proposed in the literature. Nevertheless, it is still unclear how the performance of naive Bayes anti-spam filters depends on the methods applied for reducing the dimensionality of the feature space. In this paper, we compare the performance of most popular methods used as term selection techniques with some variations of the original naive Bayes anti-spam filter.

Evaluating cost-sensitive Unsolicited Bulk Email categorization

Conference Paper

Full-text available

Mar 2002

Jose Maria Gomez Hidalgo

In the recent years, Unsolicited Bulk Email has became an increasingly important problem, with a big economic impact. In this paper, we discuss cost-sensitive Text Categorization methods for UBE filtering. In concrete, we have evaluated a range of Machine Learning methods for the task (C4.5, Naive Bayes, PART, Support Vector Machines and Rocchio), made cost sensitive through several methods (Threshold Optimization, Instance Weighting, and Meta-Cost). We have used the Receiver Operating Characteristic Convex Hull method for the evaluation, that best suits classification problems in which target conditions are not known, as it is the case. Our results do not show a dominant algorithm nor method for making algorithms cost-sensitive, but are the best reported on the test collection used, and approach real-world hand-crafted classifiers accuracy.

Ensemble-based classifiers

Article

Full-text available

Feb 2010

Lior Rokach

The idea of ensemble methodology is to build a predictive model by integrating multiple models. It is well-known that ensemble methods can be used for improving prediction performance. Researchers from various disciplines such as statistics and AI considered the use of ensemble methodology. This paper, review existing ensemble techniques and can be served as a tutorial for practitioners who are interested in building ensemble based systems.

Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Article

Full-text available

Feb 2011

E-mail spam has become an increasingly important problem with a big economic impact in society. Fortunately, there are different approaches allowing to automatically detect and remove most of those messages, and the best-known techniques are based on Bayesian decision theory. However, such probabilistic approaches often suffer from a well-known difficulty: the high dimensionality of the feature space. Many term-selection methods have been proposed for avoiding the curse of dimensionality. Nevertheless, it is still unclear how the performance of Naive Bayes spam filters depends on the scheme applied for reducing the dimensionality of the feature space. In this paper, we study the performance of many term-selection techniques with several different models of Naive Bayes spam filters. Our experiments were diligently designed to ensure statistically sound results. Moreover, we perform an analysis concerning the measurements usually employed to evaluate the quality of spam filters. Finally, we also investigate the benefits of using the Matthews correlation coefficient as a measure of performance.

Enhancement in Teaching Quality Methodology by Predicting Attendance Using Machine Learning Technique

Chapter

Feb 2020

An important task of a teacher is to make every student learn and pass the end examination. For this, teachers make lesson plans for year/semester according to number of working days with a goal to complete syllabus prior to final examination. The lesson plans are made without knowledge of the class attendance for any particular day, since it is hard for a teacher to make a correct guess. Therefore, when class strength is unexpectedly low on a given day, the teacher can either postpone the lecture to next day or continue and let the absent students be at loss. Postponing the lecture will not complete the syllabus on expected time and letting students be at loss is also not a solution. This paper will discuss the solution to this problem by using a Machine Learning Model which is trained with past records of attendance of students to find a pattern of class attendance and predict accurate class strength for any future date according to which the lesson plans can be made or modified. Teachers having prior knowledge of class strength will help them to act accordingly to achieve their goals.

Profit or Loss: A Long Short Term Memory based model for the Prediction of share price of DLF group in India

Conference Paper

Dec 2019

Presently, the prediction of share is a challenging issue for the research community as share market is a chaotic place. The reason behind it, there are several factors such as government policies, international market, weather, performance of company. In this article, a model has been developed using long short term memory (LSTM) to predict the share price of DLF group. Moreover, for the experimental purpose the data of DLF group has been taken from yahoo financial services in the time duration of 2008 to 2018 and the recurrent neural network (RNN) model has been trained using data ranging from 2008 to 2017. This RNN based model has been tested on the data of year 2018. For the performance comparison purpose, other linear regression algorithms i.e. k-nn regression, lasso regression, XGboost etc has been executed and the proposed algorithm outperforms with 2.6% root mean square error.

Complex Network Based SMS Filtering Algorithm

Article

Aug 2009

It is very important to recognize and filter the spam short messages (SMS). As the contents and formats of spam messages are diverse, the ordinary filtering methods based on keyword matching and sending speed can not tackle this problem effectively. This paper first presents a formalized representation of the SMS network. On the basis of real short message samples, the social characteristics of the SMS network are analyzed and studied. Further analysis and statistical work are carried out to discover the un-normal patterns of spam senders in SMS network. An N-degree association spam filter algorithm (NASFA) based on the un-normal patterns of spam senders is presented. Experiments and analysis show that the algorithm can efficiently recognize spam senders, and the wrong recognition rate is reduced significantly.

A behavior-based SMS antispam system

Article

Jan 2011
IBM J RES DEV

Short messaging service (SMS) is one of the fastest-growing telecom value-added services worldwide. However, mobile message spam is a side effect for ordinary mobile phone users that seriously troubles their daily life and, as a result, threatens the revenue of telecom operators. In this paper, we present an SMS antispam system that combines behavior-based social network and temporal (spectral) analysis to detect spammers with both high precision and recall. The system infrastructure and the proposed approximate neighborhood index solution, which solves the scalability issue of social networks, are described in detail. Experimental results demonstrate that our proposed system achieves excellent discrimination between spammers and legitimates, and even with fixed recall at 95%, the online system and offline detection subsystems maintain a precision of about 98% and 99.5%, respectively.

Content-based spam filtering

Conference Paper

Aug 2010

Neural Networks and Learning Machine

Chapter

Jan 2008

Simon Haykin

Detection of near-duplicate user generated contents: The SMS spam collection

Conference Paper

Oct 2011

Today, the number of spam text messages has grown in number, mainly because companies are looking for free advertising. For the users is very important to filter these kinds of spam messages that can be viewed as near-duplicate texts because mostly created from templates. The identification of spam text messages is a very hard and time-consuming task and it involves to carefully scanning hundreds of text messages. Therefore, since the task of near-duplicate detection can be seen as a specific case of plagiarism detection, we investigated whether plagiarism detection tools could be used as filters for spam text messages. Moreover we solve the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework. We carried out some preliminary experiments on the SMS Spam Collection that recently was made available for research purposes. The results were compared with the ones obtained with the CLUTO. Althought plagiarism detection tools detect a good number of near-duplicate SMS spam messages even better results are obtained with the CLUTO clustering tool.

An interactive mobile SMS confirmation method using secret sharing technique

Article

Nov 2011
COMPUT SECUR

As we all know, Short Message Service (SMS) has brought about junk emails or nonsense messages coming from advertisement providers, called SMS spam. It does bother subscribers and make them distress to check SMS messages of mobile system. Statistically, each mobile subscriber receives an average number of 8.29 short messages every week. Thus, to furnish legitimate message service to the mobile subscribers, engineers have strived to figure out an interactive service system which can certify the user-participation in a communicatory session. If a system can verify whether the communicating party is human being or not, the machine tries can be detected to mitigate the risk. To realize this essential, we aim to develop an interactive SMS confirmation mechanism using the famous techniques – CAPTCHA and secret sharing. Experimental results show that it takes slight computation costs to complete the authentication including the identity verification and the check of user-participation. This has led to predominance that the new method is suitable for mobile environment.

Email Spam Filtering: A Systematic Review

Article

Jan 2006

Gordon V. Cormack

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than “I know it when I see it.” Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be

An evaluation of statistical spam filtering techniques

Article

Dec 2004

This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.

An assessment of case-based reasoning for short text message classification

Jan 2005
257-266

M Healy
S Delany
A Zamolotskikh

Healy M, Delany S, Zamolotskikh A (2005) An assessment of case-based reasoning for short text message classification. In: Proceedings of 16th Irish Conference on Artificial Intelligence and Cognitive Science, pp 257-266

A new spam short message classification

Jan 2009
168-171

L Z Duan
A Li
L J Huang

Duan LZ, Li A, Huang LJ (2009) A new spam short message classification. In: Proceeding of the 1st International Workshop on Education Technology and Computer Science, pp 168-171

Chinese short messages service spam filtering based on logistic regression

Jan 2010
36-39

X Zheng
C Liu
Y Zou

Zheng X, Liu C, Zou Y (2010) Chinese short messages service spam filtering based on logistic regression. J Heilongjiang Inst Technol 4(24):36-39

IEEE Arvind kumar Vishwakarma Arvind kumar Vishwakarma is currently pursuing PhD in computer science from National Institute of Technology

Jan 2006
176-181

M Agarwal
V K Bohat
M D Ansari
A Sinha
S K Gupta
D Garg

Agarwal M, Bohat VK, Ansari MD, Sinha A, Gupta SK, Garg D (2019) A convolution neural network based approach to detect the disease in corn crop. In: 2019 IEEE 9th International Conference on Advanced Computing (IACC), pp. 176-181. IEEE Arvind kumar Vishwakarma Arvind kumar Vishwakarma is currently pursuing PhD in computer science from National Institute of Technology, Srinagar, Uttrakhand. He completed his M.Tech in Computer Science and Engineering from Graphic Era University, Dehradun in 2011 and obtained MCA degree from Uttar Pradesh Technical University, Lucknow, UP in 2006. He has only 3 papers in International Journals and conferences. Having the research interest in machine learning.

SMS Spam Filtering Using Machine Learning Technique

No full-text available

Recommended publications

Factorial design analysis applied to the performance of SMS anti-spam filtering systems

Mobile SMS Spam Recognition Using Machine Learning Techniques with the help of Biasian and Spam Filt...

A Study of SMS Spam using Machine Learning