Conference PaperPDF Available

Email Spam Detection Using Machine Learning Algorithms

July 2020

July 2020

DOI:10.1109/ICIRCA48905.2020.9183098

Conference: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA)

Authors:

Sanket Sonowal

Indian Institute of Technology Guwahati

Email Spam has become a major problem nowadays, with Rapid growth of internet users, Email spams is also increasing. People are using them for illegal and unethical conducts, phishing and fraud. Sending malicious link through spam emails which can harm our system and can also seek in into your system. Creating a fake profile and email account is much easy for the spammers, they pretend like a genuine person in their spam emails, these spammers target those peoples who are not aware about these frauds. So, it is needed to Identify those spam mails which are fraud, this project will identify those spam by using techniques of machine learning, this paper will discuss the machine learning algorithms and apply all these algorithm on our data sets and best algorithm is selected for the email spam detection having best precision and accuracy .

Support Vector Machine 3. DECISION TREE "Decision tree induction is the learning of decision tree from class labeled training tuples". A decision tree is a flow chart like construction, where. Internal node or non-leaf node= Test on attribute Branch = shows outcome of the test Leaf node= holds a class label Top node is called root node.

…

Decision Tree Structure Decision tree Induction:

…

Flow Chart of Model

…

Figures - uploaded by Sanket Sonowal

Content may be subject to copyright.

Content uploaded by Sanket Sonowal

Content may be subject to copyright.

Email Spam Detection Using Machine Learning

Algorithms

Nikhil Kumar

Computer Science and Engineering Department

Delhi Technological University

New Delhi, India

nikhilkmr445@gmail.com

Sanket Sonowal

Computer Science and Engineering Department

Delhi Technological University

New Delhi, India

sanketsonowal@gmail.com

Nishant

Computer Science and Engineering Department

Delhi Technological University

New Delhi, India

nishantyadav420.ny@gmail.com

Abstract—- Email S pam has become a major problem

nowadays, with Rapid growth of internet users, Email spams is

also i ncreasing. People are using them for ill egal and une thical

conducts, phishing and fraud. Sending malicious link through

spam email s which can harm our syste m and can also seek in into

your system. Creating a fake profile and email account is much

easy for the spammers, the y pretend like a genuine person i n

thei r spam emails, these spammers target those peoples who are

not aware about these frauds. S o, it i s ne eded to Identi fy those

spam mails which are fraud, this project will identify those spam

by usi ng techniques of machine learning, thi s paper will discuss

the machine learni ng algorithms and apply all these algorithm on

our data sets and best algorithm is selected for the emai l spam

detection having best precision and accuracy .

Keywords: Machine learning, Naïve Bayes, support vector

machine-nearest neighbor, random forest, baggin g, boosting,

neural networks.

I. INTRODUCT ION

Email or electronic mail spam refers to the “using of email

to send unsolicited emails or advertising emails to a group of

recipients. Unsolicited emails mean the recipient has not

granted permission for receiving those emails. “The popularity

of using spam emails is increasing since last decade. Spam has

become a big misfortune on the internet. Spam is a waste of

storage, time and message speed. Automatic email filtering

may be the most effective method of detecting spam but

nowadays spammers can easily bypass all these spam filtering

applications easily. Several years ago, most of the spam can be

blocked manually coming from certain email addresses.

Machine learning approach will be used for spam detection.

Major approaches adopted closer to junk mail filtering

encompass “text analysis, white and blacklists of domain

names, and community-primarily based techniques”. Text

assessment of contents of mails is an extensively used method

to the spams. Many answers deployable on server and

purchaser aspects are available. Naive Bayes is one of the

utmost well-known algorithms applied in these procedures.

However, rejecting sends essentially dependent on content

examination can be a difficult issue in the event of bogus

pos itives. Regularly clients and organizations would not need

any legitimate messages to be lost. The boycott approach has

been probably the soonest technique pursued for the separating

of spams. The technique is to acknowledge all the sends other

than thos e from the area/electronic mail ids. Expressly

boycotted. With more up to date areas coming into the

classification of spamming space names this technique keeps

an eye on no longer work so well. The white list approach is

the approach of accepting the mails from the domain

names/addresses openly whitelisted and place others in a much

less importance queue, that is delivered most effectively after

the sender responds to an affirmation request sent through the

“junk mail filtering system”.

Spam and Ham: According to Wikipedia “the use of

electronic messaging systems to send unsolicited bulk

messages, especially mass advertisement, malicious links etc.”

are called as spam. “Unsolicited means that those things which

you didn’t asked for mess ages from the sources. So, if you do

not know about the sender the mail can be spam. People

generally don’t realize they just signed in for those mailers

when they download any free services, software or while

updating the software. “Ham” this term was given by Spam

Bayes around 2001 and it is defined as “Emails that are not

generally desired and is not considered spam”.

Fig.1. Classification into Spam and non-spam

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

Machine learning approaches are more efficient, a set of

training data is used, these samples are the set of email which

are pre classified. Machine learning approaches have a lot of

algorithms that can be used for email filtering. These

algorithms include “Naïve Bayes, support vector machines,

Neural Networks, K-nearest neighbor, Random Forests etc.”

II. LITERATURE REVIEW

There is some related work that apply machine learning

methods in email spam detection, A. Karim, S. Azam, B.

Shanmugam, K. Kannoorpatti and M. Alazab.[ii] They

describe a focused literature survey of Artificial Intelligence

Revised (AI) and Machine learning methods for email spam

detection. K. Agarwal [3] and T. Kumar. Harisinghaney et al.

(2014) [4]and Mohamad & Selamat (2015) [v] have used the

“image and textual dataset for the e-mail spam detection with

the use of various methods. Harisinghaney et al. (2014) [iv]

have used methods of KNN algorithm, Naïve Bayes, and

Reverse DBSCAN algorithm with experimentation on datas et.

For the text recognition, OCR library” [iii] is employed but

this OCR doesn't perform well. Mohamad & Selamat (2015)

[v] uses the feature selection hybrid approach of TF-IDF

(Term Frequency Inverse Document Frequency) and Rough

pure mathematics.

A. Data Set

This model has used email data sets from different online

websites like Kaggle, sklearn and some data s ets are created

by own. A spam email data set from Kaggle is used to train

our model and then other email data set is used for getting

result “spam.csv” data set contains 5573 lines and 2 columns

and other data sets contains 574,1001,956 lines of email data

set in text format.

III. METHODOLOGY

A. Data preprocessing:

When the data is considered, always a very large data sets

with large no. of rows and columns will be noted. But it is not

always the case the data could be in many forms such as

Images, Audio and Video files Structured tables etc.

Machine doesn’t understand images or video, text data as it is,

Machine only understand 1s and 0s.

Steps in Data Preprocess ing:

Data cleaning: In this step the work like filling of “missing

values”, “smoothing of noisy data”, “identifying or removing

outliers “, and “resolving of inconsistencies is done.”

Data Integration: In this step addition of several databases,

information files or information s et is performed.

Data transformation: Aggregation and normalization is

performed to scale to a specific value

Data reduction: This section obtains a s ummary of the dataset

which is very small in size but so far produces the same

analytical result

1. Stop words:

“Stop words are the English words that do not add much

meaning to a sentence.” They can be safely ignored without

forgoing the sense of the sentence.

For example if it is tried to search a query like” How to make

a veg cheese sandwich”, the search engine will try to search

the web pages that contains the term “how”, “to” ,”make”, “a”

,”veg”, “cheese” ,”sandwich”. The search engine tries to find

the web pages that contains the term “how” ,”to”, ”a” than

page containing the recipes of veg cheese sandwich because

the terms ” how” ,”to”, “a” are so commonly used in English

language ,If these three words are removed or stopped and

actually focuses on retrieving pages that contains the keyword

” veg”, “cheese”, “sandwich” – that would give the result of

interest.

2. Tokenization:

“Tokenization is the process of splitting a s tream of

manuscript into phrase, symbols, words, or any expressive

elements named as tokens.” The rundown of token further

utilized for contribution for additional handling, for example,

content mining and parsing. Tokenization is valuable in both

semantics (where it is as content division), and as lexical

examination in software engineering and building.

It is occasionally hard to define what is intended by the term

“word”. As tokenization happens at the word level. Frequently

a token trusts on modest heuristics, for instance:

Tokens are parted by whitespaces characters, like “line break”

or “space”, or by “punctuation characters”.

Every single neighboring string of alphabetic characters are a

piece of one token; similarly, with numbers.

White spaces and punctuations might or might not involve in

the resulting lists of tokens.

3. Bag of words

“Bag of Words (BOW) is a method of extracting features from

text documents . Further these features can be us es for training

machine learning algorithms. Bag of Words creates a

vocabulary of all the unique words present in all the document

in the Training dataset.”

B. CLASSIC CLASSIFIERS

Classification is a form of data analysis that extracts the

models describing important data classes. A classifier or a

model is constructed for prediction of class labels for example:

“A loan application as risky or safe.”

Data class ification is a two-step

- learning step (cons truction of classification model.) and

- a classification step

1. NAÏVE BAYES:

Naïve Bayes class ifier was used in 1998 for s pam recognition.

The Naïve Bayes classifier algorithm is an algorithm which is

used for supervised learning. The Bayesian classifier works on

the dependent events and works on the probability of the event

which is going to occur in the future that can be detected from

the same event which occurred previously. Naïve Bayes was

made on the Bayes theorem which assumes that features are

autonomous of each other. Naïve Bayes classifier technique

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

can be used for classifying spam emails as word probability

plays main role here. If there is any word which occurs often in

spam but not in ham, then that email is spam. Naive Bayes

classifier algorithm has become a best technique for email

filtering. For this the model is trained using the Naïve Bayes

filter very well to work effectively. The Naive Bayes always

calculates the probability of each class and the class having the

maximum probability is then chosen as an output. Naïve Bayes

always provide an accurate result. It is used in many fields like

spam filtering.

(1)

(2)

2. SUPPORT VECTOR MACHINE

“The Support Vector Machine (SVM) is a popular

Supervised Learning algorithm, the Support Vector model is

used for classification problems in Machine Learning

techniques. “The Support Vector Machines totally founded on

the idea of Decision points. The Main resolution of Support

Vector Machine algorithm is to create the line or decision

boundary. The Support Vector Machine algorithm gives

hyperplane as a output which classifies new samples. In 2-

dimensional space “hyperplane is line dividing a plane into 2

parts where each class is present in one side.”

Fig.2 Support Vector Machine

3. DECISION TREE

“Decision tree induction is the learning of decision tree from

class labeled training tuples”. A decision tree is a flow chart

like construction, where.

Internal node or non- leaf node= Test on attribute

Branch = shows outcome of the test

Leaf node= holds a class label

Top node is called root node.

Fig.3. Decision Tree Structure

Decision tree Induction:

The building of “decision tree classifiers” doesn’t need “any

domain knowledge or parameter setting that is suitable for

examining knowledge. “It handles multidimensional

information. the learning and classification phases of decision

tree induction are simple and fast. Characteristic choice events

are utilized to choose the characteristic that top parcel the tuple

into particular classes. At the point when choice tree is

manufactured a s ignificant number of the branches may result

may reflect commotion and anomalies in the preparation

information. tree pruning endeavors to recognize and evacuate

such branches, with the objective of improving classifier

precision on an inconspicuous information.

Entropy using the frequency table of one attribute:

(3)

Entropy using the frequency table of two attributes:

(4)

4. K- NEAREST NEIGBOUR

“K-nearest neighbors is a supervised classification algorithm.

This algorithm has s ome data point and data vector that are

separated into several classes to predict the classification of

new sample point.”

K- Nearest neighbor is a LAZY algorithm LAZY algorithm

means it tries to only memorize the process it doesn’t learn by

itself. It doesn’t take its own decision by itself.

K- Nearest neighbor algorithm classifies new point based on a

similarity measure that can be Euclidian distance.

The Euclidean distance measure Euclidian distance and

identifies who are its neighbors.

dist((x, y), (a, b)) = √(x - a)² + (y - b)² (5)

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

C. ENSEMBLE LEARNING METHODS

“Ensemble methods in machine learning is a method that takes

several base model to produce a predictive model in order to

decrease. “variance by using bagging bias by using boosting

predictions using stacking. Two Types Sequential- here base

classifier are created sequentially Parallel- here base classifiers

are in parallel.

1. RANDOM FOREST CLASSIFIER

Random forest classifier is an ensemble tree classifier

consisting of different types of decision trees that are of

different shape and sizes.

The random sampling of the training data when building a tree.

A random subgroups of input features when splitting at node in

a tree. If you have randomness, the randomization will make

look the decision tree less corelated so that generalization error

(features of the tree should not look same) of ensemble can be

improved.

2. BAGGING

“Bagging classifier is an ensemble classifier that fits base

classifiers each on random sub sets of the original data sets and

then combined their individual calculations by voting or by

averaging) to form a final prediction. “Bagging is a mixture of

bootstrapping and aggregating.

Bagging= Bootstrap AGGregatING

Bootstrapping helps to lessening the variance of the classifier

and it also decline the overfitting by just resampling the data

from the training data with same cardinality as in original data

set. High variance is not good for the model. Bagging is very

effective method for limited data, and by just using samples

you are able to get estimate by aggregating the scores .

3. BOOSTING AND ADABOOST CLASSIFIER

“Boosting is a ensemble method that is us e to create a strong

classifier using a number of weak classifier. Boosting is

complete by creation a model from a training data sets , then

create another model that will precise the faults of the first

model.” [8] In Boosting Model are added till the training set is

predicted properly.

AdaBoost= Adaptive Boosting

AdaBoost is a first fruitful boosting algorithm that was settled

for binary classification. The boosting is understood by using

AdaBoost.

IV. ALGORITHMS

1.1. Insert the dataset or file for training or testing.

1.2. Check the dataset for supported encoding.

1.2.1. If one of the supported encodings, then go to

step 1.4.

1.2.2. If not one of the supported encoding, then go to

step 1.3.

1.3. Change the encoding of the inserted file into one of

the supported encodings. Then try again for reading.

1.4. Select whether you want to “Train”, “Test’ or

“Compare” the models using the dataset.

1.4.1. If “Train” is selected, then go to step 1.5.

1.4.2. If “Test” is selected, then go to step 1.6.

1.4.3. If “Compare” is selected, then go to step 1.7.

1.5. “Train” selected:

1.5.1. Select which classifier to train using the

inserted dataset.

1.5.2. Check for duplicates and NAN values.

1.5.3. Find the values from Hyperparameter Tuning.

1.5.4. Process the text for feature transform.

1.5.5. Train the model

1.5.6. Save the model and features. Show the results.

1.5.7. Select which classifier to test using the inserted

dataset.

1.5.8. Check for duplicates and NAN values.

1.5.9. Load the model and features saved in the

training phase of the model.

1.5.10. Using the loaded values for testing the

dataset.

1.5.11. Show the results

1.6. “Compare” selected:

1.6.1. Compare all the classifiers using the inserted

dataset.

1.6.2. Show the results of the classifiers.

A. Implementation

Visual studio code platform is used to implement the model

and, in this module, a dataset from “Kaggle” website is used

as a training dataset. The inserted dataset is first checked for

duplicates and null values for better performance of the

machine. Then, the dataset is split into 2 sub -datas ets; say

“train dataset” and “test dataset” in the proportion of 70:30.

Then the “train” and “test” dataset is then passed as

parameters for text-processing. In text-processing, punctuation

symbols and words that are in the stop words list are removed

and returned as clean words. These clean words are then

passed for “Feature Transform”. In feature transform, the

clean words which are returned from the text-processing are

then used for ‘fit’ and ‘transform’ to create a vocabulary for

the machine. The dataset is also passed for “hyperparameter

tuning” to find optimal values for the classifier to use

according to the dataset.

After acquiring the values from the “hyperparameter tuning”,

the machine is fitted using those values with a random state.

The state of the trained model and features are saved for future

use for testing unseen data.

Using classifiers from module sklearn in python, the machines

are trained using the values obtained from above.

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

B. FlowChart of the model

Fig.4. Flow Chart of Model

V. RESULT

Our model has been trained using multiple classifiers to check

and compare the results for greater accuracy. Each classifier

will give its evaluated results to the user. After all the

classifiers return its result to the user; then the user can

compare it with other results to see whether the data is “spam”

or “ham”. Each classifier result will be shown in graphs and

tables for better understanding. The dataset is obtained from

“Kaggle” website for training. The name of the dataset used is

“spam.csv”. To test the trained machine, a different CSV file is

developed with unseen data i.e. data which is not used for the

training of the machine; named “emails.csv”. After the text edit

has been completed, the paper is ready for the template.

Duplicate the template file by using the Save As command, and

use the naming convention prescribed by your conference for

the name of your paper. In this newly created file, highlight all

of the contents and import your prepared text file. You are now

ready to style your paper; use the scroll down window on the

left of the MS Word Formatting toolbar.

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

TABLE I. COMP ARISION TABLE

Classifiers

Score 1

Score 2

Score 3

Score 4

Support Vector Classifier

0.81

0.92

0.95

0.92

K-Nearest Neighbour

0.92

0.88

0.87

0.88

Naïve Bayes

0.87

0.98

Decision Tree

0.94

0.95

0.93

0.95

Random Forest

0.90

0.92

AdaBoost Classifier

0.95

0.94

0.95

0.94

Bagging Classifier

0.94

0.95

0.94

a. score 1: using def ault param eters

b. score 2: using hype rparam eter tuning

c. score 3: using stem mer a nd hyperpa rame ter tuning

d. score 4: using length, stemm er and hy perpa rame ter tuning

Fig.5 Comparison of all algorithms

Fig.6. Comparison Graph

VI. CONCLUSION

With this result, it can be concluded that the Multinomial

Naïve Bayes gives the best outcome but has limitation due to

class-conditional independence which makes the machine to

misclassify some tuples. Ensemble methods on the other hand

proven to be useful as they us ing multiple class ifiers for class

prediction. Nowadays, lots of emails are sent and received and

it is difficult as our project is only able to test emails using a

limited amount of corpus. Our project, thus spam detection is

proficient of filtering mails giving to the content of the email

and not according to the domain names or any other c riteria.

Therefore, at this it is an only limited body of the email.

There is a wide possibility of improvement in our project. The

subsequent improvements can be done:

“Filtering of spams can be done on the basis of the trusted and

verified domain names.”

“The spam email classification is very significant in

categorizing e-mails and to distinct e-mails that are spam or

non-spam.”

“This method can be used by the big body to differentiate

decent mails that are only the emails they wish to obtain.”

REFERENCES

1. Suryawanshi, Shubhangi & Goswami, Anurag & P atil, P ramod.

(2019). Email Spam Detection: An Empirical Comparative Study of

Different ML and Ensemble Classifiers. 69-74.

10.1109/IACC48062.2019.8971582.

2. Karim, A., Azam, S., Shanmugam, B., Krishnan, K., & Alazab,

M. (2019). A Comprehensive Survey for Int elligent Spam Email

Detection. IEEE Access, 7, 168261-168295.

[08907831]. https://doi.org/10.1109/ACCESS.2019.2954791

3. K. Agarwal and T . Kumar, "Email Spam Det ection Using

Int egrat ed Approach of Naïve Bayes and P article Swarm

Optimization," 2018 Second International Conference on Intelligent

Computing and Control Systems (ICICCS), Madurai, India, 2018,

pp. 685-690.

4. Harisinghaney, Anirudh, Aman Dixit, Saurabh Gupta, and Anuja

Arora. "Text and image-based spam email classification using

KNN, Naïve Bayes and Reverse DBSCAN algorithm." In

Optimization, Reliabilty, and Information Technology (ICROIT ),

2014 Int ernational Conference on, pp.153 -155. IEEE, 2014

5. Mohamad, Masurah, and Ali Selamat. "An evaluation on t he

efficiency of hybrid feature selection in spam email classification."

In Computer, Communications, and Control T echnology (I4CT),

2015 Int ernational Conference on, pp. 227-231. IEEE, 2015

6. Shradhanjali, P rof. T oran Verma “E-Mail Spam Det ect ion and

Classification Using SVM and Feature Extraction”in Internat ional

Jouranl Of Advance Reasearch, Ideas and Innovation In

Technology,2017 ISSN: 2454-132X Impact fact or: 4.295

7. W.A, Awad & S.M, ELseuofi. (2011). Machine Learning Methods

for Spam E-Mail Classification. International Journal of Computer

Science & Information Technology. 3. 10.5121/ijcsit.2011.3112.

8. A. K. Ameen and B. Kaya, "Spam detection in online social

networks by deep learning," 2018 Internat ional Conference on

Artificial Int elligence and Data Processing (IDAP ), Malatya,

Turkey, 2018, pp. 1 -4.

9. Diren, D.D., Boran, S., Selvi, I.H., & Hatipoglu, T . (2019). Root

Cause Detect ion with an Ensemble Machine Learning Approach in

the Multivariate Manufact uring Process.

10. Tasnim Kabir, Abida Sanjana Shemonti, Atif Hasan Rahman.

"Notice of Violation of IEEE Publication Principles: Species

Identification Using Partial DNA Sequence: A Machine Learning

Approach ”, 2018 IEEE 18t h Internat ional Conference on

Bioinformatics and Bioengineering (BIBE), 2018.

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.

Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection

Preprint

Full-text available

May 2024

Phishing emails continue to pose a significant threat, causing financial losses and security breaches. This study addresses limitations in existing research, such as reliance on proprietary datasets and lack of real-world application, by proposing a high-performance machine learning model for email classification. Utilizing a comprehensive and largest available public dataset, the model achieves a f1 score of 0.99 and is designed for deployment within relevant applications. Additionally, Explainable AI (XAI) is integrated to enhance user trust. This research offers a practical and highly accurate solution, contributing to the fight against phishing by empowering users with a real-time web-based application for phishing email detection.

ADVANCEMENTS IN EMAIL FORENSICS A COMPARATIVE ANALYSIS OF MACHINE LEARNING MODELS FOR SPAM DETECTION

Chapter

Full-text available

May 2024

Email continues to be a crucial means of communication in the digital age, playing an essential role in personal, intellectual, and professional interactions. Nevertheless, the widespread use of email has also made it a primary objective for unsolicited messages, varying from harmless adverts to harmful phishing schemes, posing considerable difficulties in terms of safeguarding, confidentiality, and efficiency. This study aims to address the urgent requirement for accurate spam detection by conducting a thorough comparative examination of machine learning (ML) models. The objective is to determine the best effective strategies for separating spam emails from legitimate ones. Using a dataset of labeled emails, we explore the process of preparing and analyzing text data to discover important features that are crucial for categorization. The scope of our analysis includes a broad range of machine learning classifiers, such as Support Vector Machines, Naive Bayes, Decision Trees, as well as advanced ensemble approaches like Random Forest and Gradient Boosting. In addition, we investigate the influence of feature engineering, specifically the effects of text length and word count, on improving the performance of the models. The empirical findings demonstrate that ensemble approaches, particularly the Extra Trees Classifier, attain exceptional accuracy, underscoring its superiority in terms of generalization and spam detection capacities. This research not only enhances the scholarly comprehension of spam detection mechanisms but also provides valuable insights for the practical implementation of more efficient and robust spam filtering systems, representing a crucial advancement in securing digital communication platforms against the widespread menace of spam.

Bayesian Classification for SMS Spam Detection in Mobile Devices

Article

Apr 2024

The abundance of unwanted spam messages complicates the use of Short Message Service (SMS) for efficient communication in modern times. This study investigates developing and utilizing a Naive Bayes Theorem-based Ham/Spam detection system. Because of its ease of use and effectiveness in text classification tasks, the Naive Bayes classifier is used. A collection of SMS messages labeled as” spam” or” ham” (non-spam) makes up the dataset that was used for testing and training. Preprocessing methods, including tokenization, stop-word elimination, and stemming, are employed to extract pertinent features from the text messages. The Naive Bayes classifier learns how words relate to whether they’re in a spam or non-spam message by looking at some examples from the dataset. Utilizing criteria such as accuracy, precision, and confusion matrix on a separate testing set, the classifier’s performance is evaluated. Additionally, the impact of varying parameters such as smoothing techniques and feature selection methods on the classifier’s performance is analyzed. The experimental results used to distinguishing between ham and spam messages in SMS communication

Combatting Spam in Online Chat Platform: A Comprehensive Approach to Detection and Mitigation

Article

Full-text available

Apr 2024

A Generalized Two-Level Ensemble Method for Spam Mail Detection

Article

Full-text available

Apr 2024

C. Nagaraju B. Aruna Kumari

Email is the most cost-effective way to communicate with people across the world. It offers a simple and convenient way to send and receive messages. However, it is susceptible to various types of threats. The most significant risk to emailers is spam, which refers to unsolicited emails that are often sent in large numbers to multiple recipients. Malicious spam includes links to phishing websites. These links not only pose a threat to our system, but also make our personal information accessible to hackers. Spammers often create fake profiles and email accounts making it easier for them to deceive unsuspecting victims. In this paper, the content of the email is used to detect spam which provides more information than a URL. It has been found that several techniques have been proposed to efficiently identify spam emails but current email spam detection methods have not yet achieved high accuracy. It is necessary to improve their performance in spam detection, a generalized two-level ensemble technique is explored irrespective of the type of spam datasets. The proposed two-level ensemble algorithm compares with various machine learning techniques such as SVM, Regression, and kNN, along with their variants, Finally, a K-fold and voting algorithms are applied to make the final prediction.

Designing An Intelligent Architecture to Detect Malevolent E-Mail Using Machine Learning Approach

Conference Paper

Mar 2024

Improved Spam Detection Through LSTM- Based Approach

Conference Paper

Mar 2024

Realtime spam detection system using random forest and support vector machine with countvectorizer algorithm

Conference Paper

Jan 2024

Email Spam Classifier

Article

Apr 2024

Communication plays a major part in everything be it proficient or individual. Because of its widespread use, accessibility, affordability, and free services, email is a popular communication tool. The rise in email-based attacks is a direct result of email protocol weaknesses as well as the growing volume of electronic commerce and financial activities. One of the main issues with today's Internet is email spam, which can financially harm businesses and bother individual consumers. On the internet, spam emails are the main problem. Spammers find it simple to send emails that are filled with spam. Our inbox is flooded with several pointless emails from spam. We receive an overwhelming volume of spam emails every day, making it difficult and time-consuming for us to distinguish between them. Spam remains a problem despite all the efforts made to eradicate it. Furthermore, even valid emails will be removed from consideration when countermeasures become excessively sensitive. Filtering is one of the key strategies among the methods created to prevent spam. This research aims to explore machine learning algorithms and their application to our data sets. The optimal algorithm for email spam detection is chosen based on its optimal precision and accuracy.

Phishing Detection Using Machine Learning Algorithm

Article

Full-text available

Mar 2024

Phishing is a criminal scheme to steal the user’s personal data and other credential information. It is a fraud that acquires victim’s confidential information such as password, bank account detail, credit card number, financial username and password etc. and later it can be misuse by attacker. The use of machine learning algorithms in phishing detection has gained significant attention in recent years. This research paper aims to evaluate the effectiveness of various machine learning algorithms in detecting phishing URL’s/website. The algorithms tested in this study are Decision Tree, Random Forest, Multilayer Perceptron, XGBoost, Autoencoder Neural Network, and Support Vector Machines. A dataset of phishing URLs is used to train and test the algorithms, and their performance is evaluated based on metrics such as accuracy, precision, recall, and F1 Score. The paper takes in data of phished URL from Phishtank and legitimate URL from University of New Brunswick. The results of this study demonstrate that the Random Forest and XGBoost algorithms outperforms other algorithms in terms of accuracy and other performance metrics and the system has an overall accuracy of 98 %.

A Comprehensive Survey for Intelligent Spam Email Detection

Article

Full-text available

Nov 2019

The tremendously growing problem of phishing e-mail, also known as spam including spear phishing or spam borne malware, has demanded a need for reliable intelligent anti-spam e-mail filters. This survey paper describes a focused literature survey of Artificial Intelligence (AI) and Machine Learning (ML) methods for intelligent spam email detection, which we believe can help in developing appropriate countermeasures. In this paper, we considered 4 parts in the email’s structure that can be used for intelligent analysis: (A) Headers Provide Routing Information, contain mail transfer agents (MTA) that provide information like email and IP address of each sender and recipient of where the email originated and what stopovers, and final destination. (B) The SMTP Envelope, containing mail exchangers’ identification, originating source and destination domains\users. (C) First part of SMTP Data, containing information like from, to, date, subject – appearing in most email clients (D) Second part of SMTP Data, containing email body including text content, and attachment. Based on the number the relevance of an emerging intelligent method, papers representing each method were identified, read, and summarized. Insightful findings, challenges and research problems are disclosed in this paper. This comprehensive survey paves the way for future research endeavors addressing theoretical and empirical aspects related to intelligent spam email detection.

Email Spam Detection Using Integrated Approach of Naïve Bayes and Particle Swarm Optimization

Conference Paper

Full-text available

Jun 2018

Now-a-days, communication through email has become one of the cheapest and easy ways for the official and business users due to easy availability of internet access. Most of the people prefer to use email to share important information and to maintain their official records. But just like the two sides of coin, many people misuse this easy way of communication by sending unwanted & useless bulk emails to others. These unwanted emails are spam emails that affect the normal user to face the problems like excessive usage of their mailbox memory and filtration of useful email from unwanted useless emails. So, there is the need of some autonomous approach that filters the excessive data of emails in the form of spam emails. In this paper, an integrated approach of machine learning based Naive Bayes (NB) algorithm and computational intelligence based Particle Swarm Optimization (PSO) is used for the email spam detection. Here, Naive Bayes algorithm is used for the learning and classification of email content as spam and non-spam. PSO has the stochastic distribution & swarm behavior property and considered for the global optimization of the parameters of NB approach. For experimentation, dataset of Ling spam dataset is considered and evaluated the performance in terms of precision, recall, f-measure and accuracy. Based on the evaluated results, PSO outperforms in comparison with individual NB approach.

An evaluation on the efficiency of hybrid feature selection in spam email classification

Article

Full-text available

Aug 2015

In this paper, a spam filtering technique, which implement a combination of two types of feature selection methods in its classification task will be discussed. Spam, which is also known as unwanted message always floods our electronic mail boxes, despite a spam filtering system provided by the email service provider. In addition, the issue of spam is always highlighted by Internet users and attracts many researchers to conduct research works on fighting the spam. A number of frameworks, algorithms, toolkits, systems and applications have been proposed, developed and applied by researchers and developers to protect us from spam. Several steps need to be considered in the classification task such as data pre-processing, feature selection, feature extraction, training and testing. One of the main processes in the classification task is called feature selection, which is used to reduce the dimensionality of word frequency without affecting the performance of the classification task. In conjunction with that, we had taken the initiative to conduct an experiment to test the efficiency of the proposed Hybrid Feature Selection, which is a combination of Term Frequency Inverse Document Frequency (TFIDF) with the rough set theory in spam email classification problem. The result shows that the proposed Hybrid Feature Selection return a good result.

Email Spam Detection : An Empirical Comparative Study of Different ML and Ensemble Classifiers

Conference Paper

Dec 2019

Spam detection in online social networks by deep learning

Conference Paper

Sep 2018

Species Identification Using Partial DNA Sequence: A Machine Learning Approach

Conference Paper

Oct 2018

Text and image based spam email classification using KNN, Naïve Bayes and Reverse DBSCAN algorithm

Conference Paper

Feb 2014

Machine Learning Methods for Spam E-Mail Classification.International Journal of Computer Science & Information Technology

W Awad
S Elseuofi

Root Cause Detection with an Ensemble Machine Learning Approach in the Multivariate Manufacturing Process

D D Diren
S Boran
I H Selvi
T Hatipoglu

Diren, D.D., Boran, S., Selvi, I.H., & Hatipoglu, T. (2019). Root Cause Detection with an Ensemble Machine Learning Approach in the Multivariate Manufacturing Process.

E-Mail Spam Detection and Classification Using SVM and Feature Extraction

Shradhanjali Prof
Toran Verma

Shradhanjali, Prof. T oran Verma "E-Mail Spam Detection and Classification Using SVM and Feature Extraction"in International Jouranl Of Advance Reasearch, Ideas and Innovation In T echnology,2017 ISSN: 2454-132X Impact factor: 4.295

Email Spam Detection Using Machine Learning Algorithms

Abstract and Figures

Recommended publications

Secured Mail Transformation System Using Machine Learnin

Spam E-Mail Filtering: A Review of Techniques

ML Approaches to Detect Email Spam Anamoly

Exploring the role of machine learning in Email Filtering