ArticlePDF Available

Evaluation of Machine Learning Algorithms for Intrusion Detection System in WSN

Authors:

Abstract and Figures

Technology has revolutionized into connecting ‘things’ together with the rebirth of the global network called Internet of Things (IoT). This is achieved through Wireless Sensor Network (WSN) which introduces new security challenges for Information Technology (IT) scientists and researchers. This paper addresses the security issues in WSN by establishing potential automated solutions for identifying associated risks. It also evaluates the effectiveness of various machine learning algorithms on two types of datasets, mainly, KDD99 and WSN datasets. The aim is to analyze and protect WSN networks in combination with Firewalls, Deep Packet Inspection (DPI), and Intrusion Prevention Systems (IPS) all specialized for the overall protection of WSN networks. Multiple testing options were investigated such as cross validation and percentage split. Based on the finding, the most accurate algorithm and the least time processing were suggested for both datasets.
Content may be subject to copyright.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
Evaluation of Machine Learning Algorithms for
Intrusion Detection System in WSN
Mohammed S. Alsahli1, Marwah M. Almasri2, Mousa Al-Akhras3, Abdulaziz I. Al-Issa4, Mohammed Alawairdhi5
College of Computing and Informatics, Saudi Electronic University, Riyadh 11673, KSA1, 2, 3, 4, 5
College of Computing and Informatics, Saudi Electronic University; King Abdullah II School of Information Technology
The University of Jordan, Riyadh 11673, KSA; Amman 11942, Jordan3
AbstractTechnology has revolutionized into connecting
thingstogether with the rebirth of the global network called
Internet of Things (IoT). This is achieved through Wireless
Sensor Network (WSN) which introduces new security challenges
for Information Technology (IT) scientists and researchers. This
paper addresses the security issues in WSN by establishing
potential automated solutions for identifying associated risks. It
also evaluates the effectiveness of various machine learning
algorithms on two types of datasets, mainly, KDD99 and WSN
datasets. The aim is to analyze and protect WSN networks in
combination with Firewalls, Deep Packet Inspection (DPI), and
Intrusion Prevention Systems (IPS) all specialized for the overall
protection of WSN networks. Multiple testing options were
investigated such as cross validation and percentage split. Based
on the finding, the most accurate algorithm and the least time
processing were suggested for both datasets.
Keywords—Internet of Things (IoT); Wireless Sensor Network
(WSN); Information Technology (IT); Denial of Service (DoS);
Artificial Intelligence (AI); Machine Learning (ML)
I. INTRODUCTION
With the rapid expansion of technology, new threats and
security issues arise, which become a hot area for research.
Wireless Sensor Network (WSN) is composed of distributed
wireless sensor nodes that collect raw data from the
surrounding environment. Each Sensor node is equipped with a
radio transceiver, a small microcontroller, and a power source
[1]. These nodes are very small and have limited processing
capabilities. They are designed based on low-cost and low-
energy consumption that provide limited processing power and
limited communication as represented in Fig. 1. Due to the
sensors’ limitation in memory, processing power, and energy
consumption, there are several potential security challenges
inherently exist and should be properly addressed. The primary
challenge is to protect the WSN without the availability of
massive processing power and energy. Traditional security
measures such as encryption is difficult to be implemented at
the senor’s level due to its limited processing capabilities.
With the increased and sophisticated attack types on
networks and applications, it is difficult to protect them against
such attacks manually or by common Off-The-Shelf software
such as firewalls, antivirus, Intrusion Detection System (IDS)
or Intrusion Prevention System IPS). This makes artificial
intelligence (AI) and machine learning (ML) algorithms
popular and ultimately essential in such scenarios. AI in
general and ML in specific can be used to protect WSN by
identifying and classifying potential attacks by learning
previously detected patterns of attacks.
Machine learning is becoming more popular in recent
years. It enables machines or computers to work and react
similar to what humans do. These systems improve with
experience by learning the expected behavior. AI can be
applied in many applications such as natural language
processing and generation, speech recognition, virtual agent,
machine learning, deep learning, biometrics, robotic process
automation, text analytics and Neuro-Linguistic Programming
(NLP), as well as in many domains such as healthcare,
business, education, autonomous vehicles, robotics,
government, and public safety and security. Moreover, AI
becomes very useful in predictive analysis and plays a
fundamental role in the software field and content creation.
This paper investigates different datasets with different
machine learning algorithms, namely Naïve Bayes, improved
Naïve Bayes, IBK, and Random Forest algorithms in multiple
scenarios. The purpose is to identify the best method to
mitigate the risks, threats, and security vulnerabilities
associated with WSN networks.
The rest of this paper is organized as follows. Section II
discusses related work. Section III presents the underlying
concepts and proposed methodology. Section IV shows the
experimental results. Section V discusses and analyzes the
findings. Finally, section VI concludes the paper.
Fig. 1. WSN Mechanism.
Transceiver
Sensor 1
Micro-controller
ADS
External
Sensor 2
Power Source
617 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
II. RELATED WORK
This section presents some researches about various attacks
in WSN. In [2], authors have addressed Denial of Service
(DoS) cyber-attacks on Wireless Sensor Networks (WSN) and
how to mitigate these attacks. The researchers used specialized
datasets for WSN constructed for classifying the types of
attacks for their research. Four DoS attacks were considered:
Flooding attack, Blackhole attack, Scheduling attack, and
Gray-hole attack. The main purpose was to help WSN
manufacturers to create and develop a system that detects and
protects against DoS attacks in WSN. They have also discussed
the challenges of protecting these networks due WSN
limitations such as low processing, low power, and limited
storages. They emphasized on the importance of mitigating and
protecting against new and unprecedented attacks [2].
Moreover, the authors in [3] have focused on the
classification’s accuracy improvement of the Naïve Bayes
algorithm, by finding more accurate probability estimation.
This helps in solving the lack of the training data. Their
approach was applied during the training phase without
increasing the classification time. The first phase was building
the classical Naïve Bayes classifier then fine-tune it in the
second phase. Each training instance was classified, and if it is
misclassified, it will contribute in fine tuning the probability
value. Therefore, it will be correctly classified in the next
round. Based on the findings, results showed an improved
classification accuracy of many datasets.
Many researches have defined Wireless Sensor Network
(WSN). It is typically composed of sensor nodes. These nodes
gather data about the environment and send it back to the sink
or the base station node. These data can be in different formats
such as thermal, acoustic, optical, weather, pressure, chemical,
and much more. It is extremely challenging task to develop an
algorithm that is suitable for many applications scenarios in a
diverse WSN environment; especially, considering data
reliability and aggregation, localization, clustering, fault
detection, and security [4].
Furthermore, the authors have highlighted the importance
of utilizing ML in WSN for the following reasons [4]:
1) Using ML techniques could help in observing dynamic
environments.
2) In some cases, WSN gathers new data in out-of-reach
or threatening locations.
3) Accurate models are hard to be obtained in WSN since
they are usually applied in sophisticated environments
4) Using ML techniques could be beneficial in extracting
essential correlations.
The authors in [5] have emphasized the growing number of
services that are providing facilities to humans which make
using WSN valuable in many applications such as security
systems, fire safety, various military applications, monitoring
environmental conditions, and monitoring health condition.
However, these WSNs encounter some weaknesses because of
the nodes’ exposure to various security attacks due to their
limitations in power, processing, memory storage, bandwidth,
data transmission via other nodes and multiple hops, its
distributed nature, and self-organization. These attacks occurs
at different levels of the OSI models. Therefore, it is important
to build a security defense and monitoring system to protect
against these attacks [5].
Similarly, the authors in [6] have discussed WSNs and their
crucial role in different applications and usage; the
vulnerabilities of the WSN due to their constrained resources.
How DoS attack can be carried out at different layers of the
network architecture. The authors focused specifically on the
network layer because of the diversity of the attack at this
layer. The authors reviewed many studies that use machine
learning techniques pertaining to the network layer DoS attacks
in WSN [6].
IDS and their important role in protecting against malicious
attacks that affect the performance of the network have been
addressed in [7]. The authors described Mobile Ad hoc
networks (MANETs), WSN, and Internet of Things (IoT). The
significance of the IDS and the need to protect such networks.
Their proposed an IDS that has two stages. One that collects
data using sniffers to generate correctly classified instances and
in the second stage, a super node process data from different
IDSs to differentiate benign from malicious nodes [7].
III. UNDERLYING CONCEPT AND METHODOLOGY
This section presents the dataset types as well as the used
machine learning techniques.
A. Datasets
A dataset is a collection of records that is gathered in a
controlled lab environment. In this paper, two different datasets
were used. The first dataset is called “KDDCup99 Dataset”
which was derived from the DARPA 1998 dataset [8], [9]. It
was selected and used to detect network breaches from a
network security perspective. A network breach is the abuse of
data and information to bypass the security rules and
established regulations.
The authors in [10], have explained that the discovery of
this interruption is a set of strategies and related activities that
enable the progression of perceived methods for the
identification of security classification. This dataset was
provided by the archive, which was for a data mining
competition held in aligning with KDD-99.
The author in [11] indicates that the features were to create
a model that detects the bad connections or attacks as well as
normal connections. The complete listing of the features
defined for the connection records is listed in Table I.
618 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
TABLE I. DESCRIPTION OF KDDCUP99 DATASET FEATURES
Feature Name Description
1. Duration Number of seconds of the connection
2. protocol_type Type of the protocol, e.g., TCP, UDP, etc.
3. Service Network service on the destination, e.g., http, telnet, etc.
4. Flag Normal or error status of the connection
5. src_bytes Number of data bytes from source to destination
6. dst_bytes Number of data bytes from destination to source
7. Land 1-connection is from/to the same host/port; 0-otherwise
8. wrong_fragment Number of ‘wrong’ fragments
9. Urgent Number of urgent packets
10. Hot The count of access to system directories, creation and execution of programs
11. num_failed_logins Number of failed login attempts
12. logged_in 1 - successfully logged in; 0 otherwise
13. num_compromised Number of “compromised” conditions
14. root_shell 1 - root shell is obtained; 0 otherwise
15. su_attempted 1 ‘su root’ command attempted; 0 – otherwise
16. num_root number of ‘root’ accesses
17. num_file_creations Number of file creation operations
18. num_shells Number of shell prompts
19. num_access_files Number of write, delete, and create operations on access control files
20. num_outbound_cm ds Number of outbound Commands in a ftp session
21. is_hot_login 1 - the login belongs to the ‘hot’ list (e.g., root, adm, etc.) ; 0 – otherwise
22. is_guest_login 1 - the login is a ‘guest’ login (e.g., guest, anonymous, etc.) ; 0 – otherwise
23. Count Number of connections to the same host as the current connection in the past 2 seconds
24. srv_count Number of connections to the same service as the current connection in the past 2 seconds
25. serror_rate % of connections that have ‘SYN’ errors to the same host
26. srv_serror_rate % of connections that have ‘SYN’ errors to the same service
27. rerror_rate % of connections that have ‘REJ’ errors to the same host
28. srv_rerror_rate % of connections that have ‘REJ’ errors to the same service
29. same_srv_rate % of connections to the same service and to the same host
30. diff_srv_rate % of connections to different services and to the same host
31. srv_diff_host_rate % of connections to the same service and to different hosts
32. dst_host_count Number of connections to the same host to the destination host as the current connection in the past 2
seconds
33. dst_host_srv_count Number of connections from the same service to the destination host as the current connection in the
past 2 seconds
34. dst_host_same_srv_rate % of connections from the same service to the destination host
35. dst_host_diff_srv_rate % of connections from the different services to the destination host
36. dst_host_same_src_port_rate % of connections from the port services to the destination host
37. dst_host_srv_diff_ host_rate % of connections from the different hosts from the same service to destination host
38. dst_host_serror_rate % of connections that have ‘SYN’ errors to same host to the destination host
39. dst_host_srv_ serror_rate % of connections that have ‘SYN’ errors from same service to the destination host
40. dst_host_rerror_rate % of connections that have ‘REJ’ errors from the same host to the destination host
41. dst_host_srv_ rerror_rate % of connections that have ‘REJ’ errors from the same service to the destination host
619 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
The second used dataset is called “WSN Dataset” [12],
which is specialized for WSN. It is used to detect different
types of DoS attacks as well as normal behavior. The dataset is
collected with different features and divided into different
classes such as Blackhole, Grayhole, Scheduling, Flooding,
and Normal. Low Energy Aware Cluster Hierarchy (LEACH)
is the routing protocol that is used to collect the dataset that
contains hundreds of thousands of records in WSN. It is
designed to keep energy consumption low which is very
important to maintain and improve the lifetime of WSN [13].
The problem or the limitation of LEACH is that it is only
suitable for a small size WSN [13]. It assumes that all sensors
can communicate with each other and with the sink (base
station) as shown in Fig. 2. Table II represents the different
WSN dataset attributes.
B. Machine Learning Techniques
Machine learning techniques are broadly categorized as
unsupervised and supervised learning, which are for clustering,
and classification/regression, respectively, as depicted in
Fig. 3. Classification is a problem-solving technique for
analyzing datasets or data models using algorithms such as
Naïve and IBK. Regression is commonly used as a statistical
tool to predict potential outcomes. The following subsections
demonstrate various machine learning algorithms that were
implemented on the above mentioned datasets.
1) Naïve Bayes: Naïve Bayes (NB) is a machine learning
algorithm for AI software and computers. NB is based on
mathematical calculation of probabilities that uses datasets
(raw data or simple facts) to learn a concept. NB is used in a
wide range of real applications and automated decision-
making processes. A Naïve Bayes classifier is an algorithm
that uses Bayes theorem features to classify objects. A NB is
also known as simple Bayes or an independent Bayes. These
classifiers use regular (or Naïve) independence intervals
between the attributes of a data point.
Fig. 2. WSN Network.
TABLE II. WSN-DS DATASET ATTRIBUTES [12]
# Attribute Name Attribute Description
1 Node ID A unique ID to distinguish the sensor node in any
round and at any stage
2 Time The current simulation time of the node
3 Is CH A flag to distinguish whether the node is CH or not
4 Who CH The ID of the CH in the current round
5 Distance to CH The distance between the node and its CH
6 Energy
Consumption The amount of energy consumed in the previous
round
7 ADV_CH send The number of advertise CH’s broadcast messages
sent to the nodes
8 ADV_CH
receives The number of advertise CH messages received
from CHs
9 Join_REQ send The number of join request messages sent by the
nodes to the CH
10 Join_REQ
receives The number of join request messages received by
the CH from the nodes
11 ADV_SCH
send The number of advertise TDMA schedule
broadcast messages sent to the nodes
12 ADV_SCH
receives The number of TDMA schedule messages received
from CHs
13 Rank The order of this node within the TDMA schedule
14 Data sent
The number of data packets sent from a sensor to
its CH
15 Data Received The number of data packets received from CH
16 Data sent to BS The number of data packets sent to the BS
17 Distance CH to
BS The distance between the CH and the BS
18 Send Code The cluster sending code
19 Attack Type
Type of the node. It is a class of five possible
values, namely, Blackhole, Grayhole, Flooding,
and Scheduling, in addition to normal, if the node
is not an attacker
Fig. 3. Machine Learning.
620 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
The most common and widespread use of these Bayes
algorithm is the use of spam filters or text and medical
analysis. As these classifiers are easy to implement, they are
most commonly used for machine learning. As stated by [14],
Naïve Bayes classification uses probability theory to classify
the data and makes use of Bayes theorem in its algorithm. The
main feature of this classifier is that there can be an adjustment
of the probability of an event as new data is introduced. It also
assumes all the attributes that are in consideration are
independent of each other. A Naïve Bayes classifier is not a
single algorithm, but instead, it is a combination of specific
machine learning algorithms in which statistical independence
methods are used. A Naïve Bayes classifier makes a proper
decision rule classification as long as the required class is more
probable than any other present class. This fact is deemed
accurate, as there is a slight inaccuracy in the probability
estimation most of the times [3].
2) Fine Tune Naïve Bayes (FTNB): With respect to the
Naïve Bayes classification, the tuning of parameters is limited,
and it is recommended to improve the quality of the pre-
processing and feature selection processes. The classifier
performance and prediction can be improved by tuning and
adjusting the classifier parameters, applying classifier
combination techniques, or by monitoring the data fed to the
classifier- either adding more data, refining existing one, or
improving them [3].
3) Data Parsing (pre-processing): According to [15], the
data is a string of raw text presented for each data point. A
series of processes and steps convert this data into a structured
vector such that the offset shows one feature and the value in
the offset is correspondent to the frequency. Stemming,
synonym finding and use of neutral words in the raw data text
are one of the ways to improve the data parsing or the data
processing methods.
a) Selection of Features: According to [16], the use
cases for a Naïve Bayes classification like spam filtering are
observed and utilized by showing how they fail or quickly can
be improved. For assumption, an above average spam filter
has a feature like a word frequency in all caps and words in
titles or the occurrence of exclamation symbol in the title. The
best feature for improvement is the use of long words or a
group of more than a single word.
4) IBK algorithm: Instance Base Learner (IBK) algorithm
is used in distance measure and classifying instances based on
K-nearest neighbors to make predictions [17]. The
computation in the test phase is very high and takes a long
time, especially for a huge number or instances in the dataset.
The default value of neighbors is 1. Sometimes called 1-NN
[18].
5) Random forest algorithm: Random Forest or random
decision forest algorithm is used for classification and
regression of an ensemble of the collection of datasets. In
WEKA program, Random Forest can only do the
classification part, not the regression task. It operates by
building a great number of decision trees in the training phase
and perform the classification task. In WEKA, there is no
output of the mean prediction or regression of each tree.
Random Forest classification mean mapping input data in the
dataset or instances to a category. This is also called
categorization of the instances. The algorithm that does the
classification, especially in the concrete implementation, is
called the classifier [19].
IV. EXPERIMENTS AND RESULT
This section discusses and demonstrates the experiments
conducted and their results. Both datasets have been classified
using the above-mentioned machine learning algorithms
(section III-B) using Cross-validation and percentage split
techniques. Cross validation is a standard analysis tool used to
verify the validity of the data mining model. It works by
dividing the dataset into a number of folds or pieces and hold
each fold in turn for testing and training all of the other pieces
in the system. In dividing the dataset into layers or folds, it
ensures that each layer or fold had the correct portion of class
values [20]. Additionally, Percentage split determines the
percentage used for training the system [20]. For our
experiments, 66% was used for training and 34% was used for
testing. The following subsections demonstrate the results
obtained by each algorithm conducted on both datasets using
cross-validation and percentage split techniques.
A. Naïve Bayes (NB) Algorithm
1) Cross-validation technique: Table III shows the results
of running NB algorithm on both datasets (KDDCUP99 and
WSN-DS) using cross-validation technique. Table IV
demonstrates the weighted average accuracy using cross-
validation technique in terms of several factors such as:
True Positive Rate (TP): the rate that the system or an
algorithm correctly classifies an instance as a positive
class.
True Negative Rate (TN): the rate that the system or
an algorithm correctly classifies an instance as a
negative class.
False Positive Rate (FP): the rate that the system or an
algorithm falsely (wrongly) classifies an instance as a
positive class/.
False Negative Rate (FN): the rate that the system or
an algorithm falsely (wrongly) classifies an instance as
a negative class.
Precision: the ratio of correctly classified instances as
positive to the instances that are classified by the
algorithm as positive.
Recall: the ratio of correctly classified instances as
positive to the positive instances (whether classified
correctly or not).
Receiver Operating Characteristics (ROC): is a
technique used as graph or curve to represent or
visualize the performance of the classifiers. It is
widely used in machine learning, data mining, and
decision making. Also, it is used as a method of
comparing diagnostic tests.
621 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
2) Percentage split technique: In this experiment, 66% of
the data was used for training and 34% for testing. Table V
shows the results of running Naïve Bayes (NB) algorithm on
both datasets (KDDCUP99 and WSN-DS) using the
percentage split technique. In addition, Table VI demonstrates
the weighted accuracy average using the percentage split
technique in terms of TP, TN, Precision, and ROC.
B. IBK Algorithm
1) Cross-validation technique: Table VII shows the
results of running IBK algorithm on both datasets
(KDDCUP99 and WSN-DS) using the cross- validation
technique. Table VIII demonstrates the weighted accuracy
average using the cross-validation technique.
2) Percentage split technique: In this experiment, 66% of
the data was used for training and 34% for testing. Table IX
shows the results of running IBK algorithm on both datasets
(KDDCUP99 and WSN-DS) using the percentage split
technique. Table X demonstrates the weighted accuracy
average using the percentage split technique in terms of
several factors.
C. Random Forest Algorithm
1) Cross-validation technique: Table XI shows the results
of running the Random Forest algorithm on both datasets
(KDDCUP99 and WSN-DS) using the cross-validation
technique. Table XII demonstrates the weighted accuracy
average using the cross-validation technique.
2) Percentage split technique: In this experiment, 66% of
the data was used for training and 34% for testing. Table XIII
shows the results of running IBK algorithm on both datasets
(KDDCUP99 and WSN-DS) using the percentage split
technique. Table XIV demonstrates the weighted accuracy
average using the percentage split technique in terms of
several factors.
TABLE III. THE RESULTS OF NAÏVE BAYES (NB) ALGORITHM USING THE CROSS-VALIDATION TECHNIQUE
Dataset
KDDCUP99 WSN-DS
Correctly Classified Instances 459019 92.9151 % 459019
Incorrectly Classified Instances 35001 7.0849 % 35001
Kappa statistic 0.8828 0.8828
Mean absolute error 0.0061 0.0061
Root mean squared error 0.0765 0.0765
Relative absolute error 11.955 % 11.955 %
Root relative squared error 47.6941 % 47.6941 %
Total Number of Instances 494020 494020
TABLE IV. THE WEIGHTED ACCURACY AVERAGE OF NAÏVE BAYES (NB) ALGORITHM USING THE CROSS-VALIDATION TECHNIQUE
Weighted Avg. of TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
KDDCup99 0.929 0.000 0.989 0.929 0.951 0.948 1.000 0.991
WSN Dataset 0.954 0.012 0.966 0.954 0.957 0.847 0.980 0.971
TABLE V. THE RESULTS OF NAÏVE BAYES (NB) ALGORITHM USING THE PERCENTAGE SPLIT TECHNIQUE
Dataset
WSN KDDCUP99
Correctly Classified Instances 121606 95.4634% 121606
Incorrectly Classified Instances 5779 4.5366 % 5779
Kappa statistic 0.7678 0.7678
Mean absolute error 0.0182 0.0182
Root mean squared error 0.1324 0.1324
Relative absolute error 26.2165 % 26.2165 %
Root relative squared error 71.0237 % 71.0237 %
Total Number of Instances 127385 127385
622 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
TABLE VI. THE WEIGHTED ACCURACY AVERAGE OF NAÏVE BAYES (NB) ALGORITHM USING THE PERCENTAGE SPLIT TECHNIQUE
Weighted Avg. of TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
KDDCup99 0.930 0.000 NA 0.930 NA NA 1.000 0.991
WSN Dataset
0.967
0.955
0.958
0.981
0.972
TABLE VII. THE RESULTS OF IBK ALGORITHM USING THE CROSS-VALIDATION TECHNIQUE
Dataset
KDDCUP99 WSN
Correctly Classified Instances 493796 99.9547 % 493796
Incorrectly Classified Instances 224 0.0453 % 224
Kappa statistic
0.9992
0.9992
Mean absolute error 0 0
Root mean squared error 0.0063 0.0063
Relative absolute error 0.0791 % 0.0791 %
Root relative squared error
3.9104 %
3.9104 %
Total Number of Instances 494020 494020
TABLE VIII. THE WEIGHTED ACCURACY AVERAGE OF IBK ALGORITHM USING THE CROSS-VALIDATION TECHNIQUE
Weighted Avg. of TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
KDDCup99 1.000 0.000 1.000 1.000 1.000 0.999 1.000 0.999
WSN Dataset 0.994 0.025 0.994 0.994 0.994 0.970 0.985 0.992
TABLE IX. THE RESULTS OF IBK ALGORITHM USING THE PERCENTAGE SPLIT TECHNIQUE
Dataset
KDDCUP99
WSN
Correctly Classified Instances 167869 99.9417 % 167869
Incorrectly Classified Instances 98 0.0583 % 98
Kappa statistic 0.999 0.999
Mean absolute error
0.0001
0.0001
Root mean squared error 0.0071 0.0071
Relative absolute error 0.1024 % 0.1024 %
Root relative squared error 4.4419 % 4.4419 %
Total Number of Instances
167967
167967
TABLE X. THE WEIGHTED ACCURACY AVERAGE OF IBK ALGORITHM USING THE PERCENTAGE SPLIT TECHNIQUE
Weighted Avg. of
TP Rate
FP Rate
Precision
Recall
F-Measure
MCC
ROC Area
PRC Area
KDDCup99 1.000 0.000 1.000 1.000 1.000 0.999 1.000 0.999
WSN Dataset 0.994 0.025 0.994 0.994 0.994 0.970 0.985 0.992
TABLE XI. THE RESULTS OF RANDOM FOREST ALGORITHM USING THE CROSS-VALIDATION TECHNIQUE
Dataset
KDDCUP99
WSN
Correctly Classified Instances 493915 99.9787 % 167869
Incorrectly Classified Instances 105 0.0213 % 98
Kappa statistic 0.9996 0.999
Mean absolute error
0.0001
0.0001
Root mean squared error 0.004 0.0071
Relative absolute error 0.1064 % 0.1024 %
Root relative squared error 2.5242 % 4.4419 %
Total Number of Instances
494020
167967
623 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
TABLE XII. THE WEIGHTED ACCURACY AVERAGE OF RANDOM FOREST ALGORITHM USING THE CROSS-VALIDATION TECHNIQUE
Weighted Avg. of TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
KDDCup99 1.000 0.000 NA 1.000 NA NA 1.000 1.000
WSN Dataset 0.997 0.016 0.997 0.997 0.997 0.985 0.997 0.999
TABLE XIII. THE RESULTS OF RANDOM FOREST ALGORITHM USING THE PERCENTAGE SPLIT TECHNIQUE
Dataset
KDDCUP99 WSN
Correctly Classified Instances 167915 99.969 % 167915
Incorrectly Classified Instances 52 0.031 % 52
Kappa statistic 0.9995 0.9995
Mean absolute error 0.0001 0.0001
Root mean squared error 0.0046 0.0046
Relative absolute error 0.1225 % 0.1225 %
Root relative squared error 2.8772 % 2.8772 %
Total Number of Instances 167967 167967
TABLE XIV. THE WEIGHTED ACCURACY AVERAGE OF RANDOM FOREST ALGORITHM USING THE PERCENTAGE SPLIT TECHNIQUE
Weighted Avg. of TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
KDDCup99 1.000 0.000 NA 1.000 NA NA 1.000 1.000
WSN Dataset 0.997 0.015 0.997 0.997 0.997 0.985 0.997 0.999
V. DISCUSSION AND PERFORMANCE EVALUATION
In this section, all results will be discussed and analyzed.
Overall performance evaluation will be presented as well.
Naïve Bayes algorithm, using the cross-validation technique,
has classified most of the instances correctly on both datasets.
TP in KDDCup99 is about 92.9% and in WSN-DS is 95.3%.
The errors or incorrectly classified instances were 7.08 and
4.064, respectively. Therefore, NAÏVE algorithm is more
accurate with WSN dataset than KDDCup99 dataset.
Moreover, the weighted accuracy average of both datasets is
very similar. Using the percentage split technique with the
former algorithm on both datasets showed more accurate
results as compared with the cross-validation.
Moreover, IBK algorithm was run on both datasets using
cross-validation. Both processes took no time at all, less than
one second. As can be seen from the results of the correctly
classified instances, both datasets were very close even though
the number of instances in each dataset are not the same. The
TP in KDDCUP99-DS is about (100%) and in WSN-DS is
(99.4%). The errors or incorrectly classified instances were
(0.552%) in WSN-DS and (0.0453%) in KDDCup99-DS.
Whereas the correctly classified instances in WSN-DS is
(99.4%) and in KDDCup99-DS is (99.9%) which is an
excellent accuracy in both datasets, almost (100%). This is also
reflecting on the weighted average of both datasets against the
IBK algorithm. The numbers are very similar, almost the same
(100%).
With the percentage split, using IBK algorithm was very
accurate with KDDCup99 and WSN datasets. The errors or
incorrectly classified instances were (0.584%) in WSN-DS and
(0.058%) in KDDCup99-DS. And the correctly classified
instances in WSN-DS is (99.4%) and in KDDCup99-DS is
(99.9%) which is an excellent accuracy in both datasets, almost
(100%). To sum up, IBK using percentage split test algorithm
is very accurate with KDDCup99 dataset and with WSN
dataset compared with the cross validation.
Furthermore, the Random Forest algorithm has been run on
both datasets using the cross validation and percentage split
options. With the cross validation, the TP in KDDCUP99-DS
is about (100%) and in WSN-DS is (99.7%). The errors or
incorrectly classified instances were (0.2779%) in WSN and
(0.0213%) in KDDCup99. Also, the correctly classified
instances in WSN-DS is (99.7%) and in KDDCup99-DS is
(99.9%) which is an excellent accuracy in both datasets, almost
(100%). For the percentage split, both datasets took few
seconds to process (6.24 and 8.45 respectively). The TP in
KDDCup99-DS is about (100%) and in WSN-DS is (99.7%).
The errors or incorrectly classified instances were (0.2724%) in
WSN-DS and (0.031%) in KDDCup99-DS. The correctly
classified instances in WSN-DS is (99.7%) and in KDDCup99
is (99.9%) which is an excellent accuracy in both datasets,
almost (100%). It can be concluded that Random Forest using
percentage split test algorithm is very accurate with
KDDCup99 dataset and with WSN dataset.
As an overall performance evaluation among all algorithms
and test options for KDDCup99 dataset, the NAÏVE Bayes
algorithm with cross-validation test option is the least accurate
(92.92%), meaning it has the least correctly classified
instances. On the other hand, the Random Forest algorithm
with cross-validation test option (99.98%) was the most
accurate. Similarly, for WSN dataset, the NAÏVE Bayes
algorithm with cross-validation test option is the least accurate
results (95.35%), meaning it has the least correctly classified
624 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
instances and the Random Forest algorithm with cross-
validation test option is the most accurate one (99.73%).
Moreover, the accuracy and processing time were recorded
for both datasets using all test options as shown in Fig. 4, 5, 6
and 7. The least time taken was using the IBK algorithm using
percentage split test option on WSN dataset (0.05) seconds,
then with the KDDCup99 dataset algorithm using percentage
split test option (0.08) seconds. As for accuracy measurement,
the Random Forest algorithm is the most accurate algorithm in
both datasets with all test options. The highest accuracy was
registered using cross validation on KDDCup99 dataset
(99.9787 %), then on WSN dataset (99.7276 %) using the
percentage split test option as shown in Fig. 4 and Fig. 6,
respectively.
Fig. 4. Comparison of Accuracy on KDDCup99 Dataset.
Fig. 5. Comparison of the Processing Time on KDDCup99 Dataset.
Fig. 6. Comparison of Accuracy on WSN Dataset.
Fig. 7. Comparison of the Processing Time on WSN Dataset.
VI. CONCLUSION
Due to the importance of protecting WSN against rogue
entities of hackers and intruders, taking into considerations all
constraints such as limited power, storage, and processing
capabilities, a model/dataset needs to be trained to mitigate
new or modified attack types in networks.
This paper has analyzed and compared different machine
learning algorithms against two datasets (WSN and KDD99)
using WEKA tool. The purpose was to further assist in
analyzing and protecting WSN networks in combination with
Firewalls, Deep Packet Inspection (DPI), and Intrusion
Prevention Systems (IPS) that are specialized in protecting
WSN networks. Multiple testing options were investigated
such as cross validation and percentage split. Based on the
finding, the most accurate algorithm and the least time
consuming were suggested for both datasets. Future research is
needed to create more datasets to characterize various types of
attacks in the wireless sensor networks.
625 | Page
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 5, 2021
REFERENCES
[1] P. Kurer, D. M, and H. S. Guruprasad, Energy Aware Dynamic
Clustering and Hierarchical Route based on LEACH for WSN,”
International Journal of Computer Networking Wireless and Mobile
Communications, vol. 3, no. 3, pp. 79-86, 2013.
[2] A. I. Al-issa, M. Al-Akhras, M. ALsahli, and M. Alawairdhi, “Using
Machine Learning to Detect DoS Attacks in Wireless Sensor Networks,”
Paper presented at the 2019 IEEE Jordan International Joint Conference
on Electrical Engineering and Information Technology (JEEIT), April
2019.
[3] K. Hindi, “Fine tuning the Naïve Bayesian learning algorithm,” AI
Communications, vol. 27, no. 2, pp. 133-141, 2014. doi: 10.3233/AIC-
130588.
[4] M. A. Alsheikh, S. Lin, D. Niyato, and H. Tan, “Machine Learning in
Wireless Sensor Networks: Algorithms, Strategies, and Applications,”
IEEE Communications Surveys & Tutorials, vol. 16, no. 4, pp. 1996-
2018, 2014. doi: 10.1109/COMST.2014.2320099.
[5] B. Ashwini, S. Abhale, and S. Manivannan, “Supervised Machine
Learning Classification Algorithmic Approach for Finding Anomaly
Type of Intrusion Detection in Wireless Sensor Network,” Optical
Memory and Neural Networks, vol. 29, no. 3, pp. 244-256, 2020.
Available: https://link.springer.com/article/10.3103/S1060992X200300
29#citeas.
[6] S. Gunduz, B., Arslan, and M. Demirci,A Review of Machine Learning
Solutions to Denial-of-Services Attacks in Wireless Sensor Networks,”
2015 IEEE 14th International Conference on Machine Learning and
Applications (ICMLA), 2015. Available: https://ieeexplore.ieee.org/
document/7424301/authors#authors.
[7] A. Amouri, V. T. Alaparthy, and S. D. Morgera, “A Machine Learning
Based Intrusion Detection System for Mobile Internet of Things,”
Sensors (Basel), vol. 20, no. 2, 2020. [Online]. Available:
https://pubmed.ncbi.nlm.nih.gov/31947567/.
[8] A. M. Al Tobi and I. Duncan, “KDD 1999 generation faults: a review
and analysis,” Journal of Cyber Security Technology, vol. 2, no. 3-4, pp.
164-200., 2018. doi: 10.1080/23742917.2018.1518061.
[9] E. Kabir, J. Hu, H. Wang, and G. Zhuo, A novel statistical technique
for intrusion detection systems,” Future Generation Computer Systems,
vol. 79, no. 1, pp. 303-318, 2018. doi: https://doi.org/10.1016/j.future.
2017.01.029.
[10] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani,, “A detailed
analysis of the KDD CUP 99 data set,” Paper presented at the 2009
IEEE Symposium on Computational Intelligence for Security and
Defense Applications, 2009.
[11] C. Elkan, Results of the KDD'99 classifier learning,” SIGKDD Explor.
Newsl., vol. 1, no. 2, pp. 6364, 2000. doi: 10.1145/846183.846199.
[12] I. Almomani, B. Al-Kasasbeh, and M. Al-Akhras, WSN-DS: A Dataset
for Intrusion Detection Systems in Wireless Sensor Networks.,” Journal
of Sensors, 2016. doi: 10.1155/2016/4731953.
[13] I. Almomani and B. Al-Kasasbeh, “Performance analysis of LEACH
protocol under Denial of Service attacks,” Paper presented at the 2015
6th International Conference on Information and Communication
Systems (ICICS), 2015.
[14] S. L. Ting, W. H. Ip, and A. Tsang, Is Naïve Bayes a Good Classifier
for Document Classification? International Journal of Software
Engineering and its Applications, vol. 5, no. 3, pp. 37-46., 2011.
[15] D. Meretakis and B. Wüthrich, Extending naïve Bayes classifiers using
long itemsets,” Paper presented at the KDD '99, 1999.
[16] I. Androutsopoulos, J. Koutsias, K. Chandrinos, and C. Spyropoulos,
An Experimental Comparison of Naive Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail Messages,” Proceedings of
the 23rd annual international ACM SIGIR conference on Research and
development, 2000. doi: 10.1145/345508.345569.
[17] K. El-Hindi and M. Al-Akhras, “Smoothing Decision Boundaries to
Avoid Overfitting in Neural Network Training,” Neural Network World,
vol. 21, no. 4, pp. 311-325, 2011.
[18] I. I. Baskin, G. Marcou, D. Horvath, and A. Varnek, “Classification
Models,” Tutorials in Chemoinformatics, Varnek, A. (Ed.), 2017.
[19] F. Syeda, M. A. B. Mirza, A. Baig, and M. Pawar, “Performance
Evaluation of Different Data Mining Classification Algorithm and
Predictive Analysis,” IOSR Journal of Computer Engineering, vol. 10,
no. 6, pp. 1-6, 2013. doi: 10.9790/0661-1060106.
[20] T. Borovicka, M. Jirina, and P. Kordík, Selecting Representative Data
Sets,” Advances in data mining knowledge discovery and applications,
pp. 43-70, 2012.
626 | Page
www.ijacsa.thesai.org
... DT [66] is a widely used supervised learning method for the purpose of predicting the categorical target variable. The goal of DT is to predict the variable by making a tree of decision rules from the input features. ...
... KNN [66], is another supervised learning method that works completely different from the probability-based methods such as LR and NB. KNN groups the labels or target variables based on the similarity of using the simple distance-based formula. ...
Thesis
Due to its wide range of applications, the Internet of Things (IoT) technology is evolving rapidly. One can witness IoT systems in smart cities, smart homes,smart healthcare, smart industry, and smart agriculture. IoT systems usually use low-powered and low-memory devices to sense the data from the environment and transmit it to the destination through wired or wireless communication channels. Although IoT technology is gaining massive attention in every sector of life, the security of these devices is one of the biggest challenges. Due to resource constraints, these devices are often vulnerable to malicious actors. In this work, a machine learning-based intelligent classification of the IoT network attacks using real-time heterogeneous data is carried out. Two IoT network malware datasets (Ton-IoT & IoT-23) that include the real-time IoT Botnet attacks are used for the experiments. The data is pre-processed before performing the experimentation. In addition, an information gain-based feature selection method is also applied to select the most important features in the dataset. Several classification methods include Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (KNN), Naïve Bayes (NB), and eXtreme Gradient Boosting (XGB) are implemented. These models were evaluated using classification metrics; accuracy, precision, recall, and f1-score. It is concluded that the Naïve Bayes and Logistic Regression are not the best methods to perform classification on these datasets. On the other hand, DT, RF, KNN, and XGB provided an accuracy of 99% for binary labels and 98% for multiclass labels for the Ton-IoT dataset. Using the IoT-23 dataset, these models provided accuracy above 90%. It is found that LR and NB are not the best choices for classification using either dataset. In addition, not all the features in these datasets are essential; hence some can be dropped to reduce the complexity of the model and improve the computational capacity. It is further concluded that heterogeneity in the dataset does not necessarily affect the performance of classification algorithms
... Their comparative analysis included RF, NB, and Multi-Layer Perceptron (MLP), with RF yielding optimal results with 99.5% precision. Likewise, (Alsahli et al., 2021) evaluated machine learning algorithms, including RF and K-nearest neighbours (KNN), for identifying attacks on the UNSW-NB15 benchmark dataset. RF and KNN classifiers achieved exceptional accuracy rates of 99%, while NB demonstrated inferior performance. ...
... Lack of availability of sufficient security-related data in IoT networks. (Alsahli et al., 2021) UNSW-NB15 Classifications used are RF, KNN, and NB. The study results shows that RF and KNN classifiers give the best performance with an accuracy of 100% (without noise injection) and 99% (with 10% noise filtering), while the NB classifier gives the worst performance with an accuracy of 95.35%. ...
Article
Full-text available
Internet of Things (IoT) technology has evolved significantly, transitioning from personal devices to powering smart cities and global deployments across diverse industries. However, security challenges arise due to diverse devices using various protocols and having limited computational capabilities, leading to vulnerabilities and potential intrusions in IoT networks. This paper addresses the challenge of intrusion detection in IoT by introducing a heterogeneous machine learning-based stack classifier model for IoT data. The model employs feature selection and ensemble modelling to investigate and enhance key classification metrics for intrusion detection of IoT data. This approach comprises two core components: the utilization of the K-Best algorithm for feature selection, extracting the top 15 critical features and the construction of an ensemble model incorporating various traditional machine learning models. The integration of these components harnesses information from selected features and leverages the collective strength of individual models to enhance classification performance. Using the 'Ton IoT dataset,' our experiments compare the ensemble model with individual ones. This research aims to improve key classification metrics for IoT intrusion detection , focusing on accuracy, precision, recall and F1 score. Through rigorous experimentation and comparisons, the proposed ensemble approach showcases exceptional performance, providing a robust solution to fortify IoT network security. ARTICLE HISTORY
... Research by (Almomani et al., 2021), has conducted a comparative evaluation of ten machine learning classification algorithms, and the results show that random forest outperforms other classification algorithms in terms of accuracy. The random forest algorithm also has the advantage of carrying out a cross-validation test with the most accurate results (Alsahli, Almasri, Al-Akhras, Al-Issa, & Alawairdhi, 2021). ...
Article
Full-text available
As the volume and complexity of computer network traffic continue to increase, network administrators face a growing challenge in monitoring and discovering unusual activity. To keep the network safe and functioning, detecting anomalies is essential. Machine learning-based anomaly detection techniques have become increasingly popular in recent years. This is due to the fact that conventional anomaly detection methods make it difficult to detect unknown and complex attacks. This research aims to conduct a performance analysis of two feature selection methods using the random forest algorithm using the UNSW-NB15 dataset to determine which model is most effective in detecting network traffic anomalies. The models evaluated were random forest with the filter method and random forest with the wrapper method. A number of metrics used for model performance assessment are accuracy, F1-score, receiver operating characteristic curve, and precision-recall. Dataset collection, data pre-processing, feature selection, model construction, and evaluation are the main components of the research methodology. The research results show that the Random Forest approach with the Filter method has an accuracy of 0.8950, F1-score of 0.8333, ROC score of 0.8928, and a precision-recall value of 0.8347. Meanwhile, the approach using the Wrapper method obtained an accuracy of 0.9151, F1-score of 0.8510, ROC score of 0.9136, and a precision-recall value of 0.8637. This shows that the performance of Random Forest with the Wrapper method is superior in all assessment metrics. Random Forest with the Wrapper Method is the right choice of model for detecting network traffic anomalies because of its stable performance and ability to handle complex patterns
... To provide dynamic adaptive intrusion monitoring in IoT, machine learning has been widely used in various solutions [3,4]. Although machine learning-based solutions have shown a great ability for dynamic and accurate attack intrusions in IoT, some concerns have been raised regarding data privacy in the learning process. ...
... F. Evaluation: Several significant measurement methods are utilized for the assessment of our artificial intelligence algorithms. In our investigation, the following algorithms were used including a M5P and MLP [16], Bagging(REPTree), Additive regression and Stacking [17], Random forest and decision table [18]. ...
Article
Ensuring safe drinking water is a vital worldwide task. Accurate water quality prediction is crucial for protecting public health and the environment. Machine learning provides promising solutions for this objective. The study investigates the issue of precisely forecasting water quality with machine learning models. It examines different models and their efficacy in forecasting water quality features utilizing a given dataset. We conducted a comprehensive analysis of multiple machine learning models, including Bagging(REPTree), Multilayer Perceptron, M5P, Additive Regression, Stacking, Random Forest, and Decision Table. Firstly, ourselves imported the dataset into Weka, selected and configured the models, trained them on the dataset, and evaluated their performance using various metrics. Bagging (REPTree) outperformed compared to other models, showing its effectiveness in predicting water quality. Model selection depends on goals and constraints. Future research opportunities include feature engineering, ensemble methods, and data quality issues. The study concludes that Bagging (REPTree) classifier is a strong candidate for properly predicting water quality attributes. Future research should focus on improving feature engineering, exploring ensemble methods, expanding the dataset, and enhancing model explain ability. Deploying selected models for continuous monitoring and early detection can contribute to safer water supplies and sustainable water management practices. Compliance with water quality regulations can be better ensured through the application of these models. Overall, this study offers valuable insights regarding the application of machine learning for water quality prediction and highlights future directions for research and application in this important area.
... To detect anomalies in the network, machine learning classifiers can be used [5]. Machine learning classifiers fall into three categories supervised learning, unsupervised learning and semi-supervised learning [6]. ...
Article
Full-text available
The primary objective of an intrusion detection system (IDS) is to monitor the network performance and to look into any indications of malformation over the network. While providing high-security network IDS played a vital role for the past couple of years. IDS will fail to identify all types of attacks, when it comes to anomaly detection, it is often connected with a high false alarm rate with accuracy and the detection rate is very average. Recently, IDS utilize machine learning methods, because of the way that machine learning algorithms demonstrated to have the capacity of learning and adjusting as well as permitting a proper reaction for real-time data. This work proposes a prediction-level fusion model for intrusion detection and classification using machine learning techniques. This work also proposes retraining of model for unknown attacks to increase the effectiveness of classification in IDS. The experiments are carried out on the network security layer knowledge discovery in database (NSL-KDD) dataset using the Konstanz information miner (KNIME) analytics platform. The experimental results showed a classification accuracy of 90.03% for a simple model to 96.31% for fusion and re-trained models. This result inspires the researchers to use machine learning techniques with a fusion model to build IDS.
Conference Paper
With the development of heterogeneous sensory networks in various applications of domains for ensuring the security against denial based service (DoS) attacks will become a paramount. In this research study, a proposal of a novel system named SecureSenseGuard(SSG) for effectively detecting DoS attacks in heterogeneous sensor networks. SecureSenseGuard(SSG) leveraging the Decision Tree (DT) algorithm augmented with Recursive type Features Elimination (RFE) for achievinga robust and efficient attack detection. The proposed system may operate by first Collection of data from diverged sensors within the network. The data, representing various network attributes, are then preprocessed and fed into the DT algorithm for classifications process. RFE will be employed as a feature selection techniques to enhance the performance of DT algorithm by iterative process for removing less relevant features, thus improving in detection accuracy and reducing computational overhead. The proposed system SecureSenseGuard(SSG) will offer several advantages over existing approaches. By employing DT, it will provide interpretable and transparent decision rules for facilitating understanding and trust in the detection process. Moreover, the integration of RFE will enhance the system's scalability and adaptability for dynamic network environments. To evaluate the effectiveness of SecureSenseGuard(SSG) an extensive experiment may be conducted by using various ranges of datasets and synthetic attack scenarios. The Results demonstrate that SecureSenseGuard(SSG) achieved high detection rates. while maintaining a low false alarm rate, outperforming traditional approaches. Overall, SecureSenseGuard(SSG) may present a promising solution in mitigating DoS attacks in heterogeneous sensory networks by offering a balance between detection accuracy, interpretability, and computational efficiency.
Article
Full-text available
Named Data Networking (NDN) is gaining a significant attention in Vehicular Ad-hoc Networks (VANET) due to its in-network content caching, name-based routing, and mobility-supporting characteristics. Nevertheless, existing NDN faces three significant challenges, including security, privacy, and routing. In particular, security attacks, such as Content Poisoning Attacks (CPA), can jeopardize legitimate vehicles with malicious content. For instance, attacker host vehicles can serve consumers with invalid information, which has dire consequences, including road accidents. In such a situation, trust in the content-providing vehicles brings a new challenge. On the other hand, ensuring privacy and preventing unauthorized access in vehicular (VNDN) is another challenge. Moreover, NDN’s pull-based content retrieval mechanism is inefficient for delivering emergency messages in VNDN. In this connection, our contribution is threefold. Unlike existing rule-based reputation evaluation, we propose a Machine Learning (ML)-based reputation evaluation mechanism that identifies CPA attackers and legitimate nodes. Based on ML evaluation results, vehicles accept or discard served content. Secondly, we exploit a decentralized blockchain system to ensure vehicles’ privacy by maintaining their information in a secure digital ledger. Finally, we improve the default routing mechanism of VNDN from pull to a push-based content dissemination using Publish-Subscribe (Pub-Sub) approach. We implemented and evaluated our ML-based classification model on a publicly accessible BurST-Asutralian dataset for Misbehavior Detection (BurST-ADMA). We used five (05) hybrid ML classifiers, including Logistic Regression, Decision Tree, K-Nearest Neighbors, Random Forest, and Gaussian Naive Bayes. The qualitative results indicate that Random Forest has achieved the highest average accuracy rate of 100%. Our proposed research offers the most accurate solution to detect CPA in VNDN for safe, secure, and reliable vehicle communication.
Article
Full-text available
The characteristics and performance of wireless sensor networks (WSNs) are the main reasons for their rapid expansion in various fields. However, these networks are extremely susceptible to multiple security assaults, including denial-of-service (DoS) attacks, which are among the most prevalent in these networks. This study sheds light on WSN restrictions, weaknesses, and security threats with a focus on DoS attacks. Recent techniques for DoS attack detection have been investigated thoroughly, highlighting their achievements and limitations. This provides valuable insight into the current state of recent research in this field. Accordingly, this study proposes a lightweight machine learning detection approach based on a decision tree (DT) algorithm with the Gini feature selection method to detect DoS attacks in WSNs. An enhanced version of the WSN-DS dataset, developed by the author, was used to train and test the proposed approach. The proposed approach has shown good performance by achieving an accuracy rate of 99.5% with minimum overhead compared to random forest (RF), extreme gradient boosting (XGBoost), and k-nearest neighbor (KNN) classifiers. It only takes 9.7%, 13%, and 2% of the processing time required by FR, XGBoost, and KNN respectively, which indicates that our proposed approach significantly outperforms these classifiers in terms of processing time. It is noteworthy that RF achieved an accuracy that was somewhat superior; however, the proposed approach greatly surpassed RF by taking only 9.7% of the RF processing time, which is an important factor in meeting WSN constraints.
Article
Full-text available
DARPA 1998 was one of the first Intrusion Detection datasets that was made publicly available. The KDD 1999 dataset was derived from DARPA 1998 to be used by researchers in developing machine learning (ML), classification and clustering algorithms with a security focus. DARPA 1998 has been criticised in literature due to raised concerns of problems in the dataset. Many researchers have accused KDD 1999 of having similar concerns but insufficient published evidence has been found. In this paper, we review the KDD 1999 generation process and present new proofs of existing inconsistencies in KDD 1999. We then present the process used to link some of the KDD 1999 (TELNET) records back to their origins in DARPA 1998 and discuss the interesting results and findings of this experiment.
Article
Full-text available
This paper proposes a novel approach for intrusion detection system based on sampling with Least Square Support Vector Machine (LS-SVM). Decision making is performed in two stages. In the first stage, the whole dataset is divided into some predetermined arbitrary subgroups. The proposed algorithm selects representative samples from these subgroups such that the samples reflect the entire dataset. An optimum allocation scheme is developed based on the variability of the observations within the subgroups. In the second stage, least square support vector machine (LS-SVM) is applied to the extracted samples to detect intrusions. We call the proposed algorithm as optimum allocation-based least square support vector machine (OA-LS-SVM) for IDS. To demonstrate the effectiveness of the proposed method, the experiments are carried out on KDD 99 database which is considered a de facto benchmark for evaluating the performance of intrusions detection algorithm. All binary-classes and multiclass are tested and our proposed approach obtains a realistic performance in terms of accuracy and efficiency. Finally a way out is also shown the usability of the proposed algorithm for incremental datasets.
Article
Full-text available
Wireless Sensor Networks (WSN) have become increasingly one of the hottest research areas in computer science due to their wide range of applications including critical military and civilian applications. Such applications have created various security threats, especially in unattended environments. To ensure the security and dependability of WSN services, an Intrusion Detection System (IDS) should be in place. This IDS has to be compatible with the characteristics of WSNs and capable of detecting the largest possible number of security threats. In this paper a specialized dataset for WSN is developed to help better detect and classify four types of Denial of Service (DoS) attacks: Blackhole, Grayhole, Flooding, and Scheduling attacks. This paper considers the use of LEACH protocol which is one of the most popular hierarchical routing protocols in WSNs. A scheme has been defined to collect data from Network Simulator 2 (NS-2) and then processed to produce 23 features. The collected dataset is called WSN-DS. Artificial Neural Network (ANN) has been trained on the dataset to detect and classify different DoS attacks. The results show that WSN-DS improved the ability of IDS to achieve higher classification accuracy rate. WEKA toolbox was used with holdout and 10-Fold Cross Validation methods. The best results were achieved with 10-Fold Cross Validation with one hidden layer. The classification accuracies of attacks were 92.8%, 99.4%, 92.2%, 75.6%, and 99.8% for Blackhole, Flooding, Scheduling, and Grayhole attacks, in addition to the normal case (without attacks), respectively.
Conference Paper
Full-text available
Wireless sensor networks (WSNs) are used in various fields where remote data collection is necessary, such as environment and habitat monitoring, military applications, smart homes, traffic control, and health monitoring etc. Since WSNs play a crucial role in various domains and the sensors are constrained by resources, they are vulnerable to different types of attacks. One of the main attack types that threaten WSNs is Denial-of-Service (DoS) attacks. DoS attacks can be carried out at various layers of the network architecture. In this paper, we review the DoS attacks at each layer of TCP/IP protocol stack. Among them we focus on the network layer attacks because they are more diverse than other layer attacks. We review a number of studies proposing machine learning solutions pertaining to network layer DoS attacks in WSNs. We also provide some comparative conclusions to aid researchers studying in this field.
Article
Full-text available
This work augments the Naïve Bayesian learning algorithm with a second training phase in an attempt to improve its classification accuracy. This is achieved by finding more accurate estimations of the needed probability terms. This approach helps in dealing with the problem of the lack of training data. Unlike many previous approaches that deal with this problem, the proposed method is an eager method in the sense that it does most of the work during training and, therefore, it does not increase classification time. It consists of two phases. In the first phase, the algorithm builds a classical Naïve Bayesian classifier. The second phase is a fine tuning phase. In this phase each training instance is classified, if it is misclassified, the probability values involved are fine tuned in such a way that increases the chances of correctly classifying this instance in the next round. Our results show significant improvement in the classification accuracy of many benchmark data sets, compared to the classical Naïve Bayesian, and two other methods that improve on the Naïve Bayesian algorithm.
Article
Full-text available
Optimization of resource utilization especially energy with higher Quality of Service (QoS) parameter such as packet delivery ratio etc, is an open and active research issue in wireless sensor networks (WSN). Existing methods such as direct transmission, multi-hop routing, minimum transmission energy and static clustering methods were not optimal for certain applications. LEACH protocol is built on the assumption of direct communication between cluster head and Sink nodes. Where as in many real time applications cluster head to cluster head communication is required. The Proposed work in this paper is an improvement in LEACH protocol by means of dynamic cluster head selection and routing from Cluster head to sink node through other cluster heads and their by increases the network life time. The proposed work gives better performance of network life time, low energy consumption compared to LEACH.
Article
Full-text available
Data mining is the knowledge discovery process by analyzing the large volumes of data from various perspectives and summarizing it into useful information; data mining has become an essential component in various fields of human life. It is used to identify hidden patterns in a large data set. Classification techniques are supervised learning techniques that classify data item into predefined class label. It is one of the most useful techniques in data mining to build classification models from an input data set; these techniques commonly build models that are used to predict future data trends. In this paper we have worked with different data mining applications and various classification algorithms, these algorithms have been applied on different dataset to find out the efficiency of the algorithm and improve the performance by applying data preprocessing techniques and feature selection and also prediction of new class labels.
Article
Full-text available
This work addresses the problem of overfitting the training data. We suggest smoothing the decision boundaries by eliminating border instances from the training set before training Artificial Neural Networks (ANNs). This is achieved by using a variety of instance reduction techniques. A large number of experiments were performed using 21 benchmark data sets from UCI machine learning repository , the experiments were performed with and without the introduction of noise in the data set. Our empirical results show that using a noise filtering algorithm to filter out border instances before training an ANN does not only improve the classification accuracy but also speeds up the training process by reducing the number of training epochs. The effectiveness of the approach is more obvious when the training data contains noisy instances.
Conference Paper
Nowadays, Wireless Sensor Networks (WSN) services are applied in many civilian and military scenarios. This has initiated various security threats, especially in unattended environments. Low Energy Aware Cluster Hierarchy (LEACH) is one of the most popular hierarchical routing protocols in WSN that consumes limited energy; thus increases the network lifetime. The security of data transfer via WSN is a challenging issue, especially with the existence of Denial of Service (DoS) attacks. The contribution of this paper is twofold. Firstly, experimental analysis of the impact of existing DoS attacks on the performance of LEACH protocol, including Flooding attack, Black hole attack and Gray Hole attack will be presented. Secondly, a new DoS attack on LEACH called Scheduling attack will be introduced. In order to check the reliable operation of LEACH routing protocol, all attacks mentioned above are implemented and their impacts on the performance of the LEACH in terms of different metrics including packet delivery ratio, network lifetime and energy consumption have been evaluated.