Conference PaperPDF Available

Email Spam Detection Using Machine Learning Algorithms

Authors:

Abstract and Figures

Email Spam has become a major problem nowadays, with Rapid growth of internet users, Email spams is also increasing. People are using them for illegal and unethical conducts, phishing and fraud. Sending malicious link through spam emails which can harm our system and can also seek in into your system. Creating a fake profile and email account is much easy for the spammers, they pretend like a genuine person in their spam emails, these spammers target those peoples who are not aware about these frauds. So, it is needed to Identify those spam mails which are fraud, this project will identify those spam by using techniques of machine learning, this paper will discuss the machine learning algorithms and apply all these algorithm on our data sets and best algorithm is selected for the email spam detection having best precision and accuracy .
Content may be subject to copyright.
Email Spam Detection Using Machine Learning
Algorithms
Nikhil Kumar
Computer Science and Engineering Department
Delhi Technological University
New Delhi, India
nikhilkmr445@gmail.com
Sanket Sonowal
Computer Science and Engineering Department
Delhi Technological University
New Delhi, India
sanketsonowal@gmail.com
Nishant
Computer Science and Engineering Department
Delhi Technological University
New Delhi, India
nishantyadav420.ny@gmail.com
Abstract- Email S pam has become a major problem
nowadays, with Rapid growth of internet users, Email spams is
also i ncreasing. People are using them for ill egal and une thical
conducts, phishing and fraud. Sending malicious link through
spam email s which can harm our syste m and can also seek in into
your system. Creating a fake profile and email account is much
easy for the spammers, the y pretend like a genuine person i n
thei r spam emails, these spammers target those peoples who are
not aware about these frauds. S o, it i s ne eded to Identi fy those
spam mails which are fraud, this project will identify those spam
by usi ng techniques of machine learning, thi s paper will discuss
the machine learni ng algorithms and apply all these algorithm on
our data sets and best algorithm is selected for the emai l spam
detection having best precision and accuracy .
Keywords: Machine learning, Naïve Bayes, support vector
machine-nearest neighbor, random forest, baggin g, boosting,
neural networks.
I. INTRODUCT ION
Email or electronic mail spam refers to the “using of email
to send unsolicited emails or advertising emails to a group of
recipients. Unsolicited emails mean the recipient has not
granted permission for receiving those emails. “The popularity
of using spam emails is increasing since last decade. Spam has
become a big misfortune on the internet. Spam is a waste of
storage, time and message speed. Automatic email filtering
may be the most effective method of detecting spam but
nowadays spammers can easily bypass all these spam filtering
applications easily. Several years ago, most of the spam can be
blocked manually coming from certain email addresses.
Machine learning approach will be used for spam detection.
Major approaches adopted closer to junk mail filtering
encompass “text analysis, white and blacklists of domain
names, and community-primarily based techniques”. Text
assessment of contents of mails is an extensively used method
to the spams. Many answers deployable on server and
purchaser aspects are available. Naive Bayes is one of the
utmost well-known algorithms applied in these procedures.
However, rejecting sends essentially dependent on content
examination can be a difficult issue in the event of bogus
pos itives. Regularly clients and organizations would not need
any legitimate messages to be lost. The boycott approach has
been probably the soonest technique pursued for the separating
of spams. The technique is to acknowledge all the sends other
than thos e from the area/electronic mail ids. Expressly
boycotted. With more up to date areas coming into the
classification of spamming space names this technique keeps
an eye on no longer work so well. The white list approach is
the approach of accepting the mails from the domain
names/addresses openly whitelisted and place others in a much
less importance queue, that is delivered most effectively after
the sender responds to an affirmation request sent through the
“junk mail filtering system”.
Spam and Ham: According to Wikipedia “the use of
electronic messaging systems to send unsolicited bulk
messages, especially mass advertisement, malicious links etc.”
are called as spam. “Unsolicited means that those things which
you didn’t asked for mess ages from the sources. So, if you do
not know about the sender the mail can be spam. People
generally don’t realize they just signed in for those mailers
when they download any free services, software or while
updating the software. “Ham this term was given by Spam
Bayes around 2001 and it is defined as Emails that are not
generally desired and is not considered spam”.
Fig.1. Classification into Spam and non-spam
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
978-1-7281-5374-2/20/$31.00 ©2020 IEEE 108
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
Machine learning approaches are more efficient, a set of
training data is used, these samples are the set of email which
are pre classified. Machine learning approaches have a lot of
algorithms that can be used for email filtering. These
algorithms include Naïve Bayes, support vector machines,
Neural Networks, K-nearest neighbor, Random Forests etc.”
II. LITERATURE REVIEW
There is some related work that apply machine learning
methods in email spam detection, A. Karim, S. Azam, B.
Shanmugam, K. Kannoorpatti and M. Alazab.[ii] They
describe a focused literature survey of Artificial Intelligence
Revised (AI) and Machine learning methods for email spam
detection. K. Agarwal [3] and T. Kumar. Harisinghaney et al.
(2014) [4]and Mohamad & Selamat (2015) [v] have used the
“image and textual dataset for the e-mail spam detection with
the use of various methods. Harisinghaney et al. (2014) [iv]
have used methods of KNN algorithm, Naïve Bayes, and
Reverse DBSCAN algorithm with experimentation on datas et.
For the text recognition, OCR library” [iii] is employed but
this OCR doesn't perform well. Mohamad & Selamat (2015)
[v] uses the feature selection hybrid approach of TF-IDF
(Term Frequency Inverse Document Frequency) and Rough
pure mathematics.
A. Data Set
This model has used email data sets from different online
websites like Kaggle, sklearn and some data s ets are created
by own. A spam email data set from Kaggle is used to train
our model and then other email data set is used for getting
result “spam.csv data set contains 5573 lines and 2 columns
and other data sets contains 574,1001,956 lines of email data
set in text format.
III. METHODOLOGY
A. Data preprocessing:
When the data is considered, always a very large data sets
with large no. of rows and columns will be noted. But it is not
always the case the data could be in many forms such as
Images, Audio and Video files Structured tables etc.
Machine doesn’t understand images or video, text data as it is,
Machine only understand 1s and 0s.
Steps in Data Preprocess ing:
Data cleaning: In this step the work like filling of “missing
values”, “smoothing of noisy data”, “identifying or removing
outliers “, and “resolving of inconsistencies is done.”
Data Integration: In this step addition of several databases,
information files or information s et is performed.
Data transformation: Aggregation and normalization is
performed to scale to a specific value
Data reduction: This section obtains a s ummary of the dataset
which is very small in size but so far produces the same
analytical result
1. Stop words:
“Stop words are the English words that do not add much
meaning to a sentence.” They can be safely ignored without
forgoing the sense of the sentence.
For example if it is tried to search a query like” How to make
a veg cheese sandwich”, the search engine will try to search
the web pages that contains the term “how”, “to” ,”make”, “a”
,”veg”, “cheese” ,”sandwich”. The search engine tries to find
the web pages that contains the term how” ,”to”, ”a” than
page containing the recipes of veg cheese sandwich because
the terms how” ,”to”, “a” are so commonly used in English
language ,If these three words are removed or stopped and
actually focuses on retrieving pages that contains the keyword
veg”, “cheese”, “sandwich” that would give the result of
interest.
2. Tokenization:
“Tokenization is the process of splitting a s tream of
manuscript into phrase, symbols, words, or any expressive
elements named as tokens.” The rundown of token further
utilized for contribution for additional handling, for example,
content mining and parsing. Tokenization is valuable in both
semantics (where it is as content division), and as lexical
examination in software engineering and building.
It is occasionally hard to define what is intended by the term
“word”. As tokenization happens at the word level. Frequently
a token trusts on modest heuristics, for instance:
Tokens are parted by whitespaces characters, like “line break”
or “space”, or by “punctuation characters”.
Every single neighboring string of alphabetic characters are a
piece of one token; similarly, with numbers.
White spaces and punctuations might or might not involve in
the resulting lists of tokens.
3. Bag of words
“Bag of Words (BOW) is a method of extracting features from
text documents . Further these features can be us es for training
machine learning algorithms. Bag of Words creates a
vocabulary of all the unique words present in all the document
in the Training dataset.”
B. CLASSIC CLASSIFIERS
Classification is a form of data analysis that extracts the
models describing important data classes. A classifier or a
model is constructed for prediction of class labels for example:
A loan application as risky or safe.
Data class ification is a two-step
- learning step (cons truction of classification model.) and
- a classification step
1. NAÏVE BAYES:
Naïve Bayes class ifier was used in 1998 for s pam recognition.
The Naïve Bayes classifier algorithm is an algorithm which is
used for supervised learning. The Bayesian classifier works on
the dependent events and works on the probability of the event
which is going to occur in the future that can be detected from
the same event which occurred previously. Naïve Bayes was
made on the Bayes theorem which assumes that features are
autonomous of each other. Naïve Bayes classifier technique
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
978-1-7281-5374-2/20/$31.00 ©2020 IEEE 109
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
can be used for classifying spam emails as word probability
plays main role here. If there is any word which occurs often in
spam but not in ham, then that email is spam. Naive Bayes
classifier algorithm has become a best technique for email
filtering. For this the model is trained using the Naïve Bayes
filter very well to work effectively. The Naive Bayes always
calculates the probability of each class and the class having the
maximum probability is then chosen as an output. Naïve Bayes
always provide an accurate result. It is used in many fields like
spam filtering.
(1)
(2)
2. SUPPORT VECTOR MACHINE
“The Support Vector Machine (SVM) is a popular
Supervised Learning algorithm, the Support Vector model is
used for classification problems in Machine Learning
techniques. “The Support Vector Machines totally founded on
the idea of Decision points. The Main resolution of Support
Vector Machine algorithm is to create the line or decision
boundary. The Support Vector Machine algorithm gives
hyperplane as a output which classifies new samples. In 2-
dimensional space “hyperplane is line dividing a plane into 2
parts where each class is present in one side.”
Fig.2 Support Vector Machine
3. DECISION TREE
“Decision tree induction is the learning of decision tree from
class labeled training tuples”. A decision tree is a flow chart
like construction, where.
Internal node or non- leaf node= Test on attribute
Branch = shows outcome of the test
Leaf node= holds a class label
Top node is called root node.
Fig.3. Decision Tree Structure
Decision tree Induction:
The building of “decision tree classifiers” doesn’t need “any
domain knowledge or parameter setting that is suitable for
examining knowledge. “It handles multidimensional
information. the learning and classification phases of decision
tree induction are simple and fast. Characteristic choice events
are utilized to choose the characteristic that top parcel the tuple
into particular classes. At the point when choice tree is
manufactured a s ignificant number of the branches may result
may reflect commotion and anomalies in the preparation
information. tree pruning endeavors to recognize and evacuate
such branches, with the objective of improving classifier
precision on an inconspicuous information.
Entropy using the frequency table of one attribute:
(3)
Entropy using the frequency table of two attributes:
(4)
4. K- NEAREST NEIGBOUR
“K-nearest neighbors is a supervised classification algorithm.
This algorithm has s ome data point and data vector that are
separated into several classes to predict the classification of
new sample point.”
K- Nearest neighbor is a LAZY algorithm LAZY algorithm
means it tries to only memorize the process it doesn’t learn by
itself. It doesn’t take its own decision by itself.
K- Nearest neighbor algorithm classifies new point based on a
similarity measure that can be Euclidian distance.
The Euclidean distance measure Euclidian distance and
identifies who are its neighbors.
dist((x, y), (a, b)) = √(x - a)² + (y - b (5)
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
978-1-7281-5374-2/20/$31.00 ©2020 IEEE 110
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
C. ENSEMBLE LEARNING METHODS
“Ensemble methods in machine learning is a method that takes
several base model to produce a predictive model in order to
decrease. “variance by using bagging bias by using boosting
predictions using stacking. Two Types Sequential- here base
classifier are created sequentially Parallel- here base classifiers
are in parallel.
1. RANDOM FOREST CLASSIFIER
Random forest classifier is an ensemble tree classifier
consisting of different types of decision trees that are of
different shape and sizes.
The random sampling of the training data when building a tree.
A random subgroups of input features when splitting at node in
a tree. If you have randomness, the randomization will make
look the decision tree less corelated so that generalization error
(features of the tree should not look same) of ensemble can be
improved.
2. BAGGING
“Bagging classifier is an ensemble classifier that fits base
classifiers each on random sub sets of the original data sets and
then combined their individual calculations by voting or by
averaging) to form a final prediction. “Bagging is a mixture of
bootstrapping and aggregating.
Bagging= Bootstrap AGGregatING
Bootstrapping helps to lessening the variance of the classifier
and it also decline the overfitting by just resampling the data
from the training data with same cardinality as in original data
set. High variance is not good for the model. Bagging is very
effective method for limited data, and by just using samples
you are able to get estimate by aggregating the scores .
3. BOOSTING AND ADABOOST CLASSIFIER
“Boosting is a ensemble method that is us e to create a strong
classifier using a number of weak classifier. Boosting is
complete by creation a model from a training data sets , then
create another model that will precise the faults of the first
model.” [8] In Boosting Model are added till the training set is
predicted properly.
AdaBoost= Adaptive Boosting
AdaBoost is a first fruitful boosting algorithm that was settled
for binary classification. The boosting is understood by using
AdaBoost.
IV. ALGORITHMS
1.1. Insert the dataset or file for training or testing.
1.2. Check the dataset for supported encoding.
1.2.1. If one of the supported encodings, then go to
step 1.4.
1.2.2. If not one of the supported encoding, then go to
step 1.3.
1.3. Change the encoding of the inserted file into one of
the supported encodings. Then try again for reading.
1.4. Select whether you want to “Train”, “Test or
“Compare” the models using the dataset.
1.4.1. If “Train” is selected, then go to step 1.5.
1.4.2. If “Test” is selected, then go to step 1.6.
1.4.3. If “Compare” is selected, then go to step 1.7.
1.5. “Train” selected:
1.5.1. Select which classifier to train using the
inserted dataset.
1.5.2. Check for duplicates and NAN values.
1.5.3. Find the values from Hyperparameter Tuning.
1.5.4. Process the text for feature transform.
1.5.5. Train the model
1.5.6. Save the model and features. Show the results.
1.5.7. Select which classifier to test using the inserted
dataset.
1.5.8. Check for duplicates and NAN values.
1.5.9. Load the model and features saved in the
training phase of the model.
1.5.10. Using the loaded values for testing the
dataset.
1.5.11. Show the results
1.6. “Compare” selected:
1.6.1. Compare all the classifiers using the inserted
dataset.
1.6.2. Show the results of the classifiers.
A. Implementation
Visual studio code platform is used to implement the model
and, in this module, a dataset from “Kaggle” website is used
as a training dataset. The inserted dataset is first checked for
duplicates and null values for better performance of the
machine. Then, the dataset is split into 2 sub -datas ets; say
“train dataset” and test dataset in the proportion of 70:30.
Then the “train” and “test dataset is then passed as
parameters for text-processing. In text-processing, punctuation
symbols and words that are in the stop words list are removed
and returned as clean words. These clean words are then
passed for “Feature Transform”. In feature transform, the
clean words which are returned from the text-processing are
then used for ‘fit’ and transform’ to create a vocabulary for
the machine. The dataset is also passed for “hyperparameter
tuning” to find optimal values for the classifier to use
according to the dataset.
After acquiring the values from the “hyperparameter tuning”,
the machine is fitted using those values with a random state.
The state of the trained model and features are saved for future
use for testing unseen data.
Using classifiers from module sklearn in python, the machines
are trained using the values obtained from above.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
978-1-7281-5374-2/20/$31.00 ©2020 IEEE 111
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
B. FlowChart of the model
Fig.4. Flow Chart of Model
V. RESULT
Our model has been trained using multiple classifiers to check
and compare the results for greater accuracy. Each classifier
will give its evaluated results to the user. After all the
classifiers return its result to the user; then the user can
compare it with other results to see whether the data is “spam
or “ham”. Each classifier result will be shown in graphs and
tables for better understanding. The dataset is obtained from
“Kaggle” website for training. The name of the dataset used is
“spam.csv”. To test the trained machine, a different CSV file is
developed with unseen data i.e. data which is not used for the
training of the machine; named “emails.csv”. After the text edit
has been completed, the paper is ready for the template.
Duplicate the template file by using the Save As command, and
use the naming convention prescribed by your conference for
the name of your paper. In this newly created file, highlight all
of the contents and import your prepared text file. You are now
ready to style your paper; use the scroll down window on the
left of the MS Word Formatting toolbar.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
978-1-7281-5374-2/20/$31.00 ©2020 IEEE 112
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
TABLE I. COMP ARISION TABLE
Classifiers
Score 1
Score 2
Score 3
Score 4
1
Support Vector Classifier
0.81
0.92
0.95
0.92
2
K-Nearest Neighbour
0.92
0.88
0.87
0.88
3
Naïve Bayes
0.87
0.98
0.98
0.98
4
Decision Tree
0.94
0.95
0.93
0.95
5
Random Forest
0.90
0.92
0.92
0.92
6
AdaBoost Classifier
0.95
0.94
0.95
0.94
7
Bagging Classifier
0.94
0.94
0.95
0.94
a. score 1: using def ault param eters
b. score 2: using hype rparam eter tuning
c. score 3: using stem mer a nd hyperpa rame ter tuning
d. score 4: using length, stemm er and hy perpa rame ter tuning
Fig.5 Comparison of all algorithms
Fig.6. Comparison Graph
VI. CONCLUSION
With this result, it can be concluded that the Multinomial
Naïve Bayes gives the best outcome but has limitation due to
class-conditional independence which makes the machine to
misclassify some tuples. Ensemble methods on the other hand
proven to be useful as they us ing multiple class ifiers for class
prediction. Nowadays, lots of emails are sent and received and
it is difficult as our project is only able to test emails using a
limited amount of corpus. Our project, thus spam detection is
proficient of filtering mails giving to the content of the email
and not according to the domain names or any other c riteria.
Therefore, at this it is an only limited body of the email.
There is a wide possibility of improvement in our project. The
subsequent improvements can be done:
“Filtering of spams can be done on the basis of the trusted and
verified domain names.
“The spam email classification is very significant in
categorizing e-mails and to distinct e-mails that are spam or
non-spam.
“This method can be used by the big body to differentiate
decent mails that are only the emails they wish to obtain.”
REFERENCES
1. Suryawanshi, Shubhangi & Goswami, Anurag & P atil, P ramod.
(2019). Email Spam Detection: An Empirical Comparative Study of
Different ML and Ensemble Classifiers. 69-74.
10.1109/IACC48062.2019.8971582.
2. Karim, A., Azam, S., Shanmugam, B., Krishnan, K., & Alazab,
M. (2019). A Comprehensive Survey for Int elligent Spam Email
Detection. IEEE Access, 7, 168261-168295.
[08907831]. https://doi.org/10.1109/ACCESS.2019.2954791
3. K. Agarwal and T . Kumar, "Email Spam Det ection Using
Int egrat ed Approach of Naïve Bayes and P article Swarm
Optimization," 2018 Second International Conference on Intelligent
Computing and Control Systems (ICICCS), Madurai, India, 2018,
pp. 685-690.
4. Harisinghaney, Anirudh, Aman Dixit, Saurabh Gupta, and Anuja
Arora. "Text and image-based spam email classification using
KNN, Naïve Bayes and Reverse DBSCAN algorithm." In
Optimization, Reliabilty, and Information Technology (ICROIT ),
2014 Int ernational Conference on, pp.153 -155. IEEE, 2014
5. Mohamad, Masurah, and Ali Selamat. "An evaluation on t he
efficiency of hybrid feature selection in spam email classification."
In Computer, Communications, and Control T echnology (I4CT),
2015 Int ernational Conference on, pp. 227-231. IEEE, 2015
6. Shradhanjali, P rof. T oran Verma E-Mail Spam Det ect ion and
Classification Using SVM and Feature Extraction”in Internat ional
Jouranl Of Advance Reasearch, Ideas and Innovation In
Technology,2017 ISSN: 2454-132X Impact fact or: 4.295
7. W.A, Awad & S.M, ELseuofi. (2011). Machine Learning Methods
for Spam E-Mail Classification. International Journal of Computer
Science & Information Technology. 3. 10.5121/ijcsit.2011.3112.
8. A. K. Ameen and B. Kaya, "Spam detection in online social
networks by deep learning," 2018 Internat ional Conference on
Artificial Int elligence and Data Processing (IDAP ), Malatya,
Turkey, 2018, pp. 1 -4.
9. Diren, D.D., Boran, S., Selvi, I.H., & Hatipoglu, T . (2019). Root
Cause Detect ion with an Ensemble Machine Learning Approach in
the Multivariate Manufact uring Process.
10. Tasnim Kabir, Abida Sanjana Shemonti, Atif Hasan Rahman.
"Notice of Violation of IEEE Publication Principles: Species
Identification Using Partial DNA Sequence: A Machine Learning
Approach ”, 2018 IEEE 18t h Internat ional Conference on
Bioinformatics and Bioengineering (BIBE), 2018.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2
978-1-7281-5374-2/20/$31.00 ©2020 IEEE 113
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 15,2022 at 04:25:24 UTC from IEEE Xplore. Restrictions apply.
... Future work could explore ELCADP's application to phishing classification, a domain prone to this drift. [29] investigates ensemble methods for spam detection using a multinomial Naive Bayes baseline. Trained on the Kaggle "spam.csv" ...
Preprint
Full-text available
Phishing emails continue to pose a significant threat, causing financial losses and security breaches. This study addresses limitations in existing research, such as reliance on proprietary datasets and lack of real-world application, by proposing a high-performance machine learning model for email classification. Utilizing a comprehensive and largest available public dataset, the model achieves a f1 score of 0.99 and is designed for deployment within relevant applications. Additionally, Explainable AI (XAI) is integrated to enhance user trust. This research offers a practical and highly accurate solution, contributing to the fight against phishing by empowering users with a real-time web-based application for phishing email detection.
... In their study, Kumar, Sonowal, and Nishant (2020) investigated the detection of email spam by employing various machine learning methods, such as Naïve Bayes and support vector machines. Their research adds to the wide array of strategies that may be utilized to combat spam [16]. ...
Chapter
Full-text available
Email continues to be a crucial means of communication in the digital age, playing an essential role in personal, intellectual, and professional interactions. Nevertheless, the widespread use of email has also made it a primary objective for unsolicited messages, varying from harmless adverts to harmful phishing schemes, posing considerable difficulties in terms of safeguarding, confidentiality, and efficiency. This study aims to address the urgent requirement for accurate spam detection by conducting a thorough comparative examination of machine learning (ML) models. The objective is to determine the best effective strategies for separating spam emails from legitimate ones. Using a dataset of labeled emails, we explore the process of preparing and analyzing text data to discover important features that are crucial for categorization. The scope of our analysis includes a broad range of machine learning classifiers, such as Support Vector Machines, Naive Bayes, Decision Trees, as well as advanced ensemble approaches like Random Forest and Gradient Boosting. In addition, we investigate the influence of feature engineering, specifically the effects of text length and word count, on improving the performance of the models. The empirical findings demonstrate that ensemble approaches, particularly the Extra Trees Classifier, attain exceptional accuracy, underscoring its superiority in terms of generalization and spam detection capacities. This research not only enhances the scholarly comprehension of spam detection mechanisms but also provides valuable insights for the practical implementation of more efficient and robust spam filtering systems, representing a crucial advancement in securing digital communication platforms against the widespread menace of spam.
... It emphasizes its versatility across different domains and provides an implementation. By testing on a sample dataset, the study ensures the correctness of probabilistic computations [4]. The paper addresses the pressing issue of email spam, which poses risks such as phishing and fraud. ...
Article
The abundance of unwanted spam messages complicates the use of Short Message Service (SMS) for efficient communication in modern times. This study investigates developing and utilizing a Naive Bayes Theorem-based Ham/Spam detection system. Because of its ease of use and effectiveness in text classification tasks, the Naive Bayes classifier is used. A collection of SMS messages labeled as” spam” or” ham” (non-spam) makes up the dataset that was used for testing and training. Preprocessing methods, including tokenization, stop-word elimination, and stemming, are employed to extract pertinent features from the text messages. The Naive Bayes classifier learns how words relate to whether they’re in a spam or non-spam message by looking at some examples from the dataset. Utilizing criteria such as accuracy, precision, and confusion matrix on a separate testing set, the classifier’s performance is evaluated. Additionally, the impact of varying parameters such as smoothing techniques and feature selection methods on the classifier’s performance is analyzed. The experimental results used to distinguishing between ham and spam messages in SMS communication
... Nikhil kumar [7] published a paper under the title 'Email Spam Detection Using Machine Learning Algorithms'. The paper discusses email spam and the application of machine learning algorithms to detect and filter spam emails. ...
... In [12], The paper discusses the detection of spam emails using various algorithms of ML. Naïve Bayes, SVM, Decision Tree and kNN classifiers were applied to the dataset. ...
Article
Full-text available
Email is the most cost-effective way to communicate with people across the world. It offers a simple and convenient way to send and receive messages. However, it is susceptible to various types of threats. The most significant risk to emailers is spam, which refers to unsolicited emails that are often sent in large numbers to multiple recipients. Malicious spam includes links to phishing websites. These links not only pose a threat to our system, but also make our personal information accessible to hackers. Spammers often create fake profiles and email accounts making it easier for them to deceive unsuspecting victims. In this paper, the content of the email is used to detect spam which provides more information than a URL. It has been found that several techniques have been proposed to efficiently identify spam emails but current email spam detection methods have not yet achieved high accuracy. It is necessary to improve their performance in spam detection, a generalized two-level ensemble technique is explored irrespective of the type of spam datasets. The proposed two-level ensemble algorithm compares with various machine learning techniques such as SVM, Regression, and kNN, along with their variants, Finally, a K-fold and voting algorithms are applied to make the final prediction.
Article
Communication plays a major part in everything be it proficient or individual. Because of its widespread use, accessibility, affordability, and free services, email is a popular communication tool. The rise in email-based attacks is a direct result of email protocol weaknesses as well as the growing volume of electronic commerce and financial activities. One of the main issues with today's Internet is email spam, which can financially harm businesses and bother individual consumers. On the internet, spam emails are the main problem. Spammers find it simple to send emails that are filled with spam. Our inbox is flooded with several pointless emails from spam. We receive an overwhelming volume of spam emails every day, making it difficult and time-consuming for us to distinguish between them. Spam remains a problem despite all the efforts made to eradicate it. Furthermore, even valid emails will be removed from consideration when countermeasures become excessively sensitive. Filtering is one of the key strategies among the methods created to prevent spam. This research aims to explore machine learning algorithms and their application to our data sets. The optimal algorithm for email spam detection is chosen based on its optimal precision and accuracy.
Article
Full-text available
Phishing is a criminal scheme to steal the user’s personal data and other credential information. It is a fraud that acquires victim’s confidential information such as password, bank account detail, credit card number, financial username and password etc. and later it can be misuse by attacker. The use of machine learning algorithms in phishing detection has gained significant attention in recent years. This research paper aims to evaluate the effectiveness of various machine learning algorithms in detecting phishing URL’s/website. The algorithms tested in this study are Decision Tree, Random Forest, Multilayer Perceptron, XGBoost, Autoencoder Neural Network, and Support Vector Machines. A dataset of phishing URLs is used to train and test the algorithms, and their performance is evaluated based on metrics such as accuracy, precision, recall, and F1 Score. The paper takes in data of phished URL from Phishtank and legitimate URL from University of New Brunswick. The results of this study demonstrate that the Random Forest and XGBoost algorithms outperforms other algorithms in terms of accuracy and other performance metrics and the system has an overall accuracy of 98 %.
Article
Full-text available
The tremendously growing problem of phishing e-mail, also known as spam including spear phishing or spam borne malware, has demanded a need for reliable intelligent anti-spam e-mail filters. This survey paper describes a focused literature survey of Artificial Intelligence (AI) and Machine Learning (ML) methods for intelligent spam email detection, which we believe can help in developing appropriate countermeasures. In this paper, we considered 4 parts in the email’s structure that can be used for intelligent analysis: (A) Headers Provide Routing Information, contain mail transfer agents (MTA) that provide information like email and IP address of each sender and recipient of where the email originated and what stopovers, and final destination. (B) The SMTP Envelope, containing mail exchangers’ identification, originating source and destination domains\users. (C) First part of SMTP Data, containing information like from, to, date, subject – appearing in most email clients (D) Second part of SMTP Data, containing email body including text content, and attachment. Based on the number the relevance of an emerging intelligent method, papers representing each method were identified, read, and summarized. Insightful findings, challenges and research problems are disclosed in this paper. This comprehensive survey paves the way for future research endeavors addressing theoretical and empirical aspects related to intelligent spam email detection.
Conference Paper
Full-text available
Now-a-days, communication through email has become one of the cheapest and easy ways for the official and business users due to easy availability of internet access. Most of the people prefer to use email to share important information and to maintain their official records. But just like the two sides of coin, many people misuse this easy way of communication by sending unwanted & useless bulk emails to others. These unwanted emails are spam emails that affect the normal user to face the problems like excessive usage of their mailbox memory and filtration of useful email from unwanted useless emails. So, there is the need of some autonomous approach that filters the excessive data of emails in the form of spam emails. In this paper, an integrated approach of machine learning based Naive Bayes (NB) algorithm and computational intelligence based Particle Swarm Optimization (PSO) is used for the email spam detection. Here, Naive Bayes algorithm is used for the learning and classification of email content as spam and non-spam. PSO has the stochastic distribution & swarm behavior property and considered for the global optimization of the parameters of NB approach. For experimentation, dataset of Ling spam dataset is considered and evaluated the performance in terms of precision, recall, f-measure and accuracy. Based on the evaluated results, PSO outperforms in comparison with individual NB approach.
Article
Full-text available
In this paper, a spam filtering technique, which implement a combination of two types of feature selection methods in its classification task will be discussed. Spam, which is also known as unwanted message always floods our electronic mail boxes, despite a spam filtering system provided by the email service provider. In addition, the issue of spam is always highlighted by Internet users and attracts many researchers to conduct research works on fighting the spam. A number of frameworks, algorithms, toolkits, systems and applications have been proposed, developed and applied by researchers and developers to protect us from spam. Several steps need to be considered in the classification task such as data pre-processing, feature selection, feature extraction, training and testing. One of the main processes in the classification task is called feature selection, which is used to reduce the dimensionality of word frequency without affecting the performance of the classification task. In conjunction with that, we had taken the initiative to conduct an experiment to test the efficiency of the proposed Hybrid Feature Selection, which is a combination of Term Frequency Inverse Document Frequency (TFIDF) with the rough set theory in spam email classification problem. The result shows that the proposed Hybrid Feature Selection return a good result.
Machine Learning Methods for Spam E-Mail Classification.International Journal of Computer Science & Information Technology
  • W Awad
  • S Elseuofi
Root Cause Detection with an Ensemble Machine Learning Approach in the Multivariate Manufacturing Process
  • D D Diren
  • S Boran
  • I H Selvi
  • T Hatipoglu
Diren, D.D., Boran, S., Selvi, I.H., & Hatipoglu, T. (2019). Root Cause Detection with an Ensemble Machine Learning Approach in the Multivariate Manufacturing Process.
E-Mail Spam Detection and Classification Using SVM and Feature Extraction
  • Shradhanjali Prof
  • Toran Verma
Shradhanjali, Prof. T oran Verma "E-Mail Spam Detection and Classification Using SVM and Feature Extraction"in International Jouranl Of Advance Reasearch, Ideas and Innovation In T echnology,2017 ISSN: 2454-132X Impact factor: 4.295