Conference PaperPDF Available

Clustering Enabled Classification using Ensemble Feature Selection for Intrusion Detection

Authors:

Abstract and Figures

Machine learning has been leveraged to increase the effectiveness of intrusion detection systems (IDSs). The focus of this approach, however, has largely be on detecting known attack patterns based on outdated datasets. In this paper, we propose an ensemble feature selection method along with an anomaly detection method that combines unsupervised and supervised machine learning techniques to classify network traffic to identify previously unseen attack patterns. To that end, three different feature selection techniques are used as part of an ensemble model that selects 8 common features. Moreover, k-Means clustering is used to first partition the training instances into k clusters using the Manhattan distance. A classification model is then built based on the resulting clusters, which represent a density region of normal or anomaly instances. This in turn helps determine the effectiveness of the clustering in detecting unknown attack patterns within the data. The performance of our classifier is evaluated using the Kyoto dataset, which was collected between 2006 and 2015. To our knowledge, no previous work proposed such a framework that combines unsupervised and supervised machine learning approaches using this dataset. Experimental results show the effectiveness of the proposed framework in detecting previously unseen attack patterns compared to the traditional classification approach.
Content may be subject to copyright.
Clustering Enabled Classification using Ensemble
Feature Selection for Intrusion Detection
Fadi Salo, MohammadNoor Injadat, Abdallah Moubayed, Ali Bou Nassif, Aleksander Essex
Department of Electrical and Computer Engineering, The University of Western Ontario, London, ON, Canada
Email: {fsalo, minjadat, amoubaye, aessex}@uwo.ca
Department of Electrical and Computer Engineering, University of Sharjah, Sharjah, UAE
Email: anassif@sharjah.ac.ae
Abstract—Machine learning has been leveraged to increase the
effectiveness of intrusion detection systems (IDSs). The focus of
this approach, however, has largely be on detecting known attack
patterns based on outdated datasets. In this paper, we propose
an ensemble feature selection method along with an anomaly
detection method that combines unsupervised and supervised
machine learning techniques to classify network traffic to identify
previously unseen attack patterns. To that end, three different
feature selection techniques are used as part of an ensemble
model that selects 8 common features. Moreover, k-Means cluster-
ing is used to first partition the training instances into kclusters
using the Manhattan distance. A classification model is then built
based on the resulting clusters, which represent a density region
of normal or anomaly instances. This in turn helps determine
the effectiveness of the clustering in detecting unknown attack
patterns within the data. The performance of our classifier is
evaluated using the Kyoto dataset, which was collected between
2006 and 2015. To our knowledge, no previous work proposed
such a framework that combines unsupervised and supervised
machine learning approaches using this dataset. Experimental
results show the effectiveness of the proposed framework in
detecting previously unseen attack patterns compared to the
traditional classification approach.
Index Terms—Network anomaly detection, Ensemble feature
selection, k-means clustering, Classification, Kyoto dataset.
I. INTRODUCTION
Two performance indicators often used to evaluate the
effectiveness of intrusion detection systems (IDSs) [1] are
precision and stability [2], [3]. Past research has focused on
rule-based systems and statistical approaches [4], however the
performance of such approaches suffers in larger datasets. Data
mining and machine learning approaches have been proposed
as one promising solution to this problem [5].
IDSs typically classify abnormal traffic as either anomaly-
based and misuse-based, with each class having merits and
limitations. Misuse-based methods compare the data against a
predefined set of rules or patterns to detect network attacks.
However, such approaches are not adaptable and are limited in
their ability to detect previously unseen attack types. On the
other hand, anomaly-based methods collect data that represent
normal behavior and build familiarity models, and actions
deviating from the model are labeled as suspicious/anomalous
[6].
Despite the promising improvements achieved by previous
work, intrusion detection is still a challenging problem. This is
exacerbated by the high volume of traffic data, a continuously
evolving environment, and an abundance of available features
[7]. For example, a set of irrelevant, redundant, or highly
correlated features can be found in high dimensional datasets,
which can negatively impact the accuracy and performance of
IDSs. Choosing an appropriate subset of features is essential
to improving the detection model [8].
In this paper, we propose an effective intrusion detection
framework based on ensemble feature selection, clustering,
and supervised machine learning classifiers that can detect
previously unseen attack patterns with high accuracy. Toward
the development of an accurate and robust anomaly detection
methodology, we used several graphical and statistical ex-
ploratory data analytics techniques on our chosen dataset. The
result of our analysis led us to select Support Vector Machine
with Gaussian kernels (SVM-RBF), k-nearest neighbors (k-
NN), Random Forest (RF), and quadratic discriminant analysis
(QDA) due to the non-linear nature of the considered dataset.
The performance of our approach is evaluated and com-
pared to the traditional classification approach by conducting
different experiments with the Kyoto 2006+ dataset that was
built during a 9 year period of network traffic collection
from honeypots in Kyoto University [9]. To our knowledge,
no previous work proposed such a framework that combines
unsupervised and supervised machine learning approaches
using this dataset.
This paper combines the use of homogeneous ensem-
ble feature selection using three feature selection algorithms
(correlation-based, information gain-based, and significance-
based) with k-means clustering and different classification
techniques to improve the IDSs ability to detect and iden-
tify unknown patterns. The feasibility and efficiency of the
considered methods is evaluated using various metrics such as
accuracy, precision, recall, F-measure, and false alarm rate.
The main contributions of this paper include:
An investigation of the the behavior and characteristics
of the Kyoto dataset using a graphical and statistical
exploratory data analytics approach.
A proposal for a homogeneous ensemble feature selection
approach using three base feature selection algorithms.
The first study, to our knowledge, of the efficiency and ef-
fectiveness of a clustering-based classification framework
for anomaly detection using the Kyoto 2006+ dataset.
The remainder of this paper is organized as follows. Section
II summarizes some of the related work. Section III gives a
brief overview of considered algorithms. Section IV presents
the proposed framework. Section V discusses the research
methodology and the experimental results. Finally, Section VI
concludes the paper and provides future research directions.
II. RE LATE D WOR K
Researchers have typically treated intrusion detection as a
classification problem. To that end, various supervised ma-
chine learning classifiers such as support vector machines
(SVM) [10], decision trees [11], k-nearest neighbor (k-NN)
[12], and naive Bayes have been proposed. See e.g., the
review of Tsai et al.[13]. Further promising results were
obtained by proposing novel data mining-based approaches
(cf. Wu and Banzhaf [14]). Recently, hybrid optimization-
based models have been proposed to improve the performance
of intrusion detection systems. For instance, Chung and Wahid
[15] proposed a hybrid approach that includes feature se-
lection and classification with simplified swarm optimization
(SSO). The performance of SSO was further improved by
using weighted local search (WLS) to obtain better solutions
from the neighborhood [15]. The authors reported intrusion
detection accuracy up to 93.3%. Similarly, Kuang et al. [16]
proposed a hybrid method that combined genetic algorithm
(GA) and multi-layered SVM with kernel principal compo-
nent analysis (KPCA) to enhance the detection performance.
Another technique by Zhang et al. [17] combined misuse and
anomaly detection using random forests. In contrast, a novel
particle swarm optimization-based algorithm, Catfish-BPSO,
was proposed in [18] to select features and enhance the model
performance.
III. THEORETIC ASP EC TS O F THE TECHNIQUES
Starting with the clustering, the k-means algorithm was
chosen to group the instances into two clusters. The algorithm
was chosen specifically due to its simplicity, effectiveness
in dealing with network traffic data [19], and flexibility in
offering the choice of the desired number of clusters. In brief,
k-means is an unsupervised machine learning algorithm that
groups instances into kclusters using a particular distance
metric such as Euclidean, Manhattan, or Mahalanobis distance
[20]. This distance metric is used to determine the proximity
of each instance to the cluster centroid. Upon convergence,
the output of the algorithm is the centroid of each of the k
clusters and the cluster label of each instance.
On the other hand, four different supervised machine learn-
ing classification algorithms are considered in this work:
support vector machines (SVM), knearest neighbors (k-
NN), random forests (RF), and quadratic discriminant analysis
(QDA). We briefly recall some specifics. These algorithms are
chosen due to their ability to deal with non-linear datasets.
SVM is a supervised machine learning classification algo-
rithm that tries to determine the maximum separation hyper-
plane between two classes to identify the class positive and
negative [21]. The output of the SVM with Gaussian kernel
(SVM-RBF) is [22]:
f(x) = wTΦ(x) + b(1)
where Φ(x)represents the used kernel. The goal is to deter-
mine the weight vector wTand intercept bthat minimizes the
following objective function:
min
w,b
1
2w2+
C
m
X
i=1
hyi·cost1(f(xi)) + (1 yi)·cost0(f(xi))i(2)
where Cis a regularization parameter that penalizes incor-
rectly classified instances and costiis the squared error over
the training dataset.
k-NN is a simple classification algorithm that determines
the class of an instance based on the majority class of its k
nearest neighboring points. This is done by first evaluating
the distance from the data point to all other points within
the training dataset. Different distance measures can be used,
such as the Euclidean, Manhattan, or Mahalanoblis distance
[20]. After determining the distance, the knearest points are
identified and a majority voting-based decision is made on the
class of the considered data point [23].
Random forests are an ensemble learning decision tree-
based classifier combining several decision trees to predict the
class [24]. Each tree is independently and randomly sampled
with their results combined using a majority vote. The RF
classifier sends any new incoming data points to each of its
trees and chooses the class that is classified by the most trees.
Finally, discriminant analysis is a statistical technique that
tries to find a group of prediction equations based on inde-
pendent variables [25]. This technique can be used for one
of two objectives, either to determine a prediction equation
that can be used to classify new input points, or to interpret
the equation to get a better understanding of the relationship
that exists between the variables [25]. A quadratic kernel is
one which assumes a quadratic relationship exists between the
independent variables.
IV. PROPOSED FRAMEWOR K
This section presents an overview of the clustering-enabled
classification framework for intrusion detection. The goal is
to evaluate the efficiency of the framework to detect un-
seen/unknown patterns. This is done in two-folds: first by
comparing the result of the clustering to that of the true
label of the instance, and second by using a separate data
sample to act as the testing dataset for unseen data instances.
The framework is as follows: the first step consists of se-
lecting appropriate features. To that end, this work proposes
an ensemble feature selection technique that combines three
different feature selection methods, namely information gain,
correlation, and significance methods. The features chosen are
the features common among the three methods. Following
feature selection, graphical and statistical data analytics is
applied to get a better understanding of the behavior of the
selected features. Then, the training dataset is clustered using
the k-means algorithm, which we chose due to its simplicity
and effectiveness in dealing with network traffic data [19].
This is followed by building the final classification model
using the selected classification algorithms. The choice of
these algorithms is dependent on the insights gained from the
exploratory data analytics step. The model is then applied to
the testing dataset to evaluate the efficiency of the clustering
in detecting previously unseen patterns. The overall flow of
the proposed framework is illustrated in Fig. 1.
Feature Selection
Datasets
Building Model
Testing Model
Result
NormalNormal
Information
Gain
Selected Features
Training
Testing
Clustering
(kMeans)
Classification
(SVM/KNN/RF/QDA)
Final Model
AttackNormal
Correlation Significance
Fig. 1. Proposed framework
V. EXPERIMENTAL SETUP AND RESULT
DISCUSSION
A. Dataset Description
We used the Kyoto 2006+ dataset to evaluate our proposed
framework. This dataset was collected from honeypots by the
University of Kyoto over the 9 year period from Nov. 2006
to Dec. 2015. It consists of 1 million records, each containing
24 features [9]. A random subset of approximately 300,000
records was chosen to form our experimental dataset. This
was further divided into training and testing datasets, using a
60/40 split. The training dataset consisted of 178,479 records
with 92,729 normal and 85,750 attack records. The testing
dataset consisted of 118,986 records with 61,765 normal and
57,221 attack records.
B. Experimental setup and Data Preprocessing
The proposed techniques were implemented using MAT-
LAB 2018a. The selected dataset was transformed from their
original format into a new dataset consisting of 8 features as
illustrated using the Venn diagram shown in Fig. 2.
Correlation
(11)
Information Gain
(11)
Significance
(20)
Fig. 2. Ensemble Feature Selection Output
Outlier removal was performed using Inter-Quartile Range
(IQR) to remove any redundant or noisy data points. As most
of the classifiers do not accept categorical features [26], data
mapping was used to transform non-numeric feature values
into numeric ones (named categorical in MATLAB).
C. Prediction Performance Measures
To evaluate and compare prediction models quantitatively,
the following measurements were utilized:
Accuracy =T P +T N
T P +T N +F P +F N (3)
P recision =T P
T P +F P (4)
Recall =T P
T P +F N (5)
F-measure = 2 ·P recision ·Recall
P recision +Recall (6)
where T P is the true positive rate, T N is the true negative
rate, F P is the false positive rate, and F N is the false negative
rate [27].
Fig. 3. Probability Density Function of Normal and Attack Traffic
Fig. 4. Number of Same Service to Same Destination ID vs Same Service
Rate
D. Results Discussion
The goal of this work is to evaluate the effectiveness of
the clustering method in detecting unseen/unknown patterns
within the dataset. However, a crucial initial step to better
understand the dataset is to study its behavior to gain insights
into it. To that end, exploratory data analytics is applied
by plotting the probability density function of two features,
namely the same srv rate (same service rate) and the Dst host
srv count (number of connections of the same service type to
the same destination IP) for both normal and attack traffic as
shown in Fig. 3. It can be seen that normal traffic data tends
to have higher service rate and number of same services to
same destination IP. The service rate refers to the percentage
of the connections that have the same service type (e.g. http,
telnet, etc.) On the other hand, attack traffic data has the exact
opposite trend. These statistical trends give us initial insights
into the behavior of normal and attack traffic that can be
helpful in future prediction.
Fig. 4 plots the two highest ranked numeric features against
each other. It plots the number of same service to the same
destination against the same service rate. It is clear that the
dataset is not linearly separable. This provides further insights
and justifies the choice of using SVM-RBF, k-NN, RF, and
QDA methods as they can handle non-linear data.
Table I shows the performance of the considered classifi-
cation algorithms for both the training and testing datasets
without clustering. These results will be used as a benchmark
to gauge the effectiveness of the clustering algorithm in
detecting previously unseen patterns.
Table II, on the other hand, shows the performance of
the different classification algorithms with clustering. Several
observations can be made. First is that almost all algorithms
TABLE I
PER FOR MA NCE R ESU LTS OF T HE T HRE E CL ASS IFI ERS W IT HOU T CL UST ER ING
Training Testing
Classifier Acc(%) Precision Recall FAR F-measure Acc(%) Precision Recall FAR F-measure
SVM-RBF 96.88 0.982 0.958 0.018 0.97 80.03 0.767 0.883 0.288 0.821
k-NN (k=3) 98.03 0.985 0.977 0.016 0.981 86.95 0.882 0.864 0.124 0.873
k-NN (k=5) 97.65 0.982 0.972 0.018 0.977 88.33 0.898 0.875 0.107 0.886
RF 98.52 0.994 0.978 0.006 0.986 57.79 0.911 0.207 0.021 0.337
QDA 87.1 0.926 0.817 0.071 0.868 87.02 0.936 0.805 0.059 0.866
TABLE II
PER FOR MA NCE R ESU LTS OF T HE T HRE E CL ASS IFI ERS W IT H CLU ST ERI NG
Training Testing
Classifier Acc(%) Precision Recall FAR F-measure Acc(%) Precision Recall FAR F-measure
SVM-RBF 98.24 0.988 0.984 0.02 0.986 76.89 0.716 0.92 0.393 0.805
k-NN (k=3) 98.96 0.991 0.991 0.013 0.992 82.28 0.783 0.911 0.272 0.842
k-NN (k=5) 98.56 0.989 0.988 0.018 0.988 81.33 0.768 0.918 0.299 0.836
RF 99.98 0.999 0.999 0.001 0.999 79.06 0.741 0.916 0.344 0.82
QDA 93.49 0.973 0.92 0.041 0.946 81.63 0.784 0.892 0.265 0.834
(with the exception of QDA) achieved good training accuracy,
as shown in Table I. However, this was not necessarily
translated in the testing accuracy. This shows that algorithms
such as RF are not well suited to anomaly detection as it only
had a testing accuracy of approximately 58%. The second
observation is that using clustering to detect anomalies is
effective. This is based on the fact that the testing accuracy
of the classifiers after clustering is close to that of the non-
clustering case. This shows that clustering is able to detect
previously unseen patterns effectively. This is further high-
lighted in Figs. 5 and 6 which show the training and testing
accuracy of the different classification algorithms with and
without clustering. The results in Fig. 5 are expected, since
the model was trained using this dataset and hence will have
a high accuracy. However, the results in Fig. 6 emphasize the
efficiency of the proposed clustering-enabled classification. It
can be seen that the difference in testing accuracy between the
clustering and the non-clustering cases is less than 10% for
most classification algorithms. This means that the clustering
was able to predict previously unseen patterns with a relatively
high accuracy. By comparison, results reported in [28] only
shown an accuracy between 60% and 77%. However, this
proposed work was able to achieve higher testing accuracy
with the lowest being approximately 77%, thus illustrating
the effectiveness of the proposed framework. Moreover, it
can be concluded that k-NN with 3 neighbors has the best
performance given that it achieved high training accuracy and
had the smallest difference in testing accuracy between the
non-clustering and the clustering cases.
VI. CONCLUSIONS
In this paper, an efficient intrusion detection framework
based on homogeneous ensemble feature selection, clustering,
and supervised machine learning classifiers was proposed. This
was done in order to test the efficiency of the clustering
algorithm in detecting previously unseen attack patterns for
96.88
98.03 97.65
98.52
87.10
98.24 98.96 98.56
99.98
93.49
SVM-RBF k-NN(k=3) k-NN(k=5) RF QDA
85
90
95
100
Accuracy
Without Clustering
With Clustering
Fig. 5. Overall Accuracy of Training Dataset
80.03
86.95 88.33
57.79
87.02
76.89
82.28 81.33 79.05
81.63
SVM-RBF k-NN(k=3) k-NN(k=5) RF QDA
55
60
65
70
75
80
85
90
95
100
Accuracy
Without Clustering
With Clustering
Fig. 6. Overall Accuracy of Testing Dataset
intrusion detection. The techniques considered were chosen
based on the nature of the selected dataset which was investi-
gated using different graphical and statistical exploratory data
analytics techniques such as the probability density function.
The performance was evaluated and compared by conducting
different experiments with the Kyoto 2006+ dataset that was
built during a 9 years of real traffic data collection (between
Nov. 2006 and Dec. 2015) from diverse types of honeypots in
Kyoto University. To the best of our knowledge, no previous
work proposed such a framework using this dataset. To explore
the dataset, a homogeneous ensemble feature selection mech-
anism using three feature selection algorithms (correlation-
based, information gain-based, and significance-based) was
applied to extract 8 representative features. This was followed
by applying different graphical and statistical exploratory data
analytics techniques to better understand the behavior of the
features. The results of this data analysis showed that the
dataset is highly non-linear, which motivated the choice of
the considered supervised classification algorithms. The new
dataset was then clustered using k-means algorithm and a clas-
sification model was then built using different classification
techniques to improve the IDS’s ability to detect and identify
unknown patterns. Experimental results showed that k-means
clustering was indeed efficient in detecting previously unseen
patterns. This was highlighted by the small difference in
testing accuracy between the clustering and the non-clustering
cases which did not exceed the 10% range for most classifiers.
Furthermore, it was also shown that the k-NN algorithm with
k= 3 had the best performance as it achieved high training
accuracy and had the smallest testing accuracy difference. In
order to further improve the performance of the proposed
approach, we plan to develop an adaptive model that clusters
any new attacks with existing ones. This in turn will provide
a more robust and dynamic intrusion detection system and
improve its security.
REFERENCES
[1] R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection
and big heterogeneous data: a survey,” Journal of Big Data, vol. 2,
no. 1, p. 3, Feb 2015. [Online]. Available: https://doi.org/10.1186/
s40537-015- 0013-4
[2] L. de S´
a Silva, A. C. F. dos Santos, T. D. Mancilha, J. D. S. da Silva,
and A. Montes, “Detecting attack signatures in the real network traffic
with annida.” Elsevier, 2008, vol. 34, no. 4, pp. 2326–2333.
[3] A. Patcha and J.-M. Park, “An overview of anomaly detection tech-
niques: Existing solutions and latest technological trends,” Computer
networks, vol. 51, no. 12, pp. 3448–3470, 2007.
[4] S. Mukkamala, G. Janoski, and A. Sung, “Intrusion detection using
neural networks and support vector machines,” vol. 2, pp. 1702–1707,
2002.
[5] S.-Y. Wu and E. Yen, “Data mining-based intrusion detectors,” Expert
Systems with Applications, vol. 36, no. 3, pp. 5605–5612, 2009.
[6] C. Kruegel, F. Valeur, and G. Vigna, “Intrusion detection and correlation:
challenges and solutions,” vol. 14, 2004.
[7] S. Suthaharan, “Big data classification: Problems and challenges in net-
work intrusion prediction with machine learning,” ACM SIGMETRICS
Performance Evaluation Review, vol. 41, no. 4, pp. 70–73, 2014.
[8] J. Zhang and M. Zulkernine, “Anomaly based network intrusion de-
tection with unsupervised outlier detection,” in Communications, 2006.
ICC’06. IEEE International Conference on, vol. 5. IEEE, 2006, pp.
2388–2393.
[9] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an intrusion
detection system using a filter-based feature selection algorithm,” IEEE
Transactions on Computers, vol. 65, no. 10, pp. 2986–2998, Oct 2016.
[10] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel feature-selection
approach based on the cuttlefish optimization algorithm for intrusion
detection systems,” Expert Systems with Applications, vol. 42, no. 5,
pp. 2670–2679, 2015.
[11] W. Li, P. Yi, Y. Wu, L. Pan, and J. Li, “A new intrusion detection system
based on knn classification algorithm in wireless sensor network,”
Journal of Electrical and Computer Engineering, vol. 2014, 2014.
[12] S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based in-
trusion detection system through feature selection analysis and building
hybrid efficient model,Journal of Computational Science, 2017.
[13] C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin, and W.-Y. Lin, “Intrusion detection by
machine learning: A review,” Expert Systems with Applications, vol. 36,
no. 10, pp. 11 994–12 000, 2009.
[14] S. X. Wu and W. Banzhaf, “The use of computational intelligence in
intrusion detection systems: A review,” Applied soft computing, vol. 10,
no. 1, pp. 1–35, 2010.
[15] Y. Y. Chung and N. Wahid, “A hybrid network intrusion detection system
using simplified swarm optimization (sso),” Applied Soft Computing,
vol. 12, no. 9, pp. 3014–3022, 2012.
[16] F. Kuang, W. Xu, and S. Zhang, “A novel hybrid kpca and svm with
ga model for intrusion detection,” Applied Soft Computing, vol. 18, pp.
178–184, 2014.
[17] J. Zhang, M. Zulkernine, and A. Haque, “Random-forests-based network
intrusion detection systems,” IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), vol. 38, no. 5, pp. 649–
659, 2008.
[18] A. J. Malik and F. A. Khan, “A hybrid technique using multi-objective
particle swarm optimization and random forests for probe attacks
detection in a network,” in Systems, Man, and Cybernetics (SMC), 2013
IEEE International Conference on. IEEE, 2013, pp. 2473–2478.
[19] Y. Liu, W. Li, and Y.-C. Li, “Network traffic classification using k-means
clustering,” in Computer and Computational Sciences, 2007. IMSCCS
2007. Second International Multi-Symposiums on. IEEE, 2007, pp.
360–365.
[20] A. Kind, M. P. Stoecklin, and X. Dimitropoulos, “Histogram-based
traffic anomaly detection,IEEE Transactions on Network and Service
Management, vol. 6, no. 2, pp. 110–121, 2009.
[21] I. S. Thaseen and C. A. Kumar, “Intrusion detection model using fusion
of chi-square feature selection and multi class svm,” Journal of King
Saud University-Computer and Information Sciences, vol. 29, no. 4, pp.
462–472, 2017.
[22] H. Bostani and M. Sheikhan, “Modification of supervised opf-based
intrusion detection systems using unsupervised learning and social
network concept,” Pattern Recognition, vol. 62, pp. 56–72, 2017.
[23] W. Meng, W. Li, and L.-F. Kwok, “Design of intelligent knn-based alarm
filter using knowledge-based alert verification in intrusion detection,
Security and Communication Networks, vol. 8, no. 18, pp. 3883–3895,
2015.
[24] M. Injadat, F. Salo, and A. B. Nassif, “Data mining techniques in
social media: A survey,” Neurocomputing, vol. 214, pp. 654 – 670,
2016. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S092523121630683X
[25] N. S. Software, “Chapter 440:discriminant analysis,” Available at:
https://ncss-wpengine.netdna- ssl.com/wp-content/themes/ncss/pdf/
Procedures/NCSS/Discriminant Analysis.pdf.
[26] M. Salem and U. Buehler, “Mining techniques in network security to
enhance intrusion detection systems,” arXiv preprint arXiv:1212.2414,
2012.
[27] M. H. Tang, C. Ching, S. Poon, S. S. Chan, W. Ng, M. Lam, C. Wong,
R. Pao, A. Lau, and T. W. Mak, “Evaluation of three rapid oral fluid test
devices on the screening of multiple drugs of abuse including ketamine,
Forensic science international, 2018.
[28] F. Hosseinpour, P. V. Amoli, F. Farahnakian, J. Plosila, and
T. H¨
am¨
al¨
ainen, “Artificial immune system based intrusion detection:
innate immunity using an unsupervised learning approach,” International
Journal of Digital Content Technology and its Applications, vol. 8, no. 5,
p. 1, 2014.
... • NSL-KDD: Kyoto University gathered network traffic records for this dataset using honeypots, darknet sensors, email servers, web crawlers, and other network security protocols. Each record has 24 statistical attributes, 14 of which are extracted from the KDD Cup'99 dataset and ten new ones (Salo et al., 2019). ...
Article
Full-text available
The advancement of communication and internet technology has brought risks to network security. Thus, Intrusion Detection Systems (IDS) was developed to combat malicious network attacks. However, IDSs still struggle with accuracy, false alarms, and detecting new intrusions. Therefore, organizations are using Machine Learning (ML) and Deep Learning (DL) algorithms in IDS for more accurate attack detection. This paper provides an overview of IDS, including its classes and methods, the detected attacks as well as the dataset, metrics, and performance indicators used. A thorough examination of recent publications on IDS-based solutions is conducted, evaluating their strengths and weaknesses, as well as a discussion of their potential implications, research challenges, and new trends. We believe that this comprehensive review paper covers the most recent advances and developments in ML and DL-based IDS, and also facilitates future research into the potential of emerging Artificial Intelligence (AI) to address the growing complexity of cybersecurity challenges
... The primary difficulty in anomaly detection lies in recognizing patterns within data that fail to anticipate expected behavior [131]- [133]. Anomaly detection is used in various applications such as cybersecurity, network intrusion detection, detecting unusual video activity, fault detection, streaming, and hyperspectral imaging [134]- [136]. There are various techniques for anomaly detection. ...
Article
Full-text available
Given the continually rising frequency of cyberattacks, the adoption of artificial intelligence methods, particularly Machine Learning (ML), Deep Learning (DL), and Reinforcement Learning (RL), has become essential in the realm of cybersecurity. These techniques have proven to be effective in detecting and mitigating cyberattacks, which can cause significant harm to individuals, organizations, and even countries. Machine learning algorithms use statistical methods to identify patterns and anomalies in large datasets, enabling security analysts to detect previously unknown threats. Deep learning, a subfield of ML, has shown great potential in improving the accuracy and efficiency of cybersecurity systems, particularly in image and speech recognition. On the other hand, RL is again a subfield of machine learning that trains algorithms to learn through trial and error, making it particularly effective in dynamic environments. We also evaluated the usage of ChatGPT-like AI tools in cyber-related problem domains on both sides, positive and negative. This article provides an overview of how ML, DL, and RL are applied in cybersecurity, including their usage in malware detection, intrusion detection, vulnerability assessment, and other areas. The state-of-the-art studies using ML, DL, and RL models are evaluated in each section based on the main idea, techniques, and important findings. It also discusses these techniques’ challenges and limitations, including data quality, interpretability, and adversarial attacks. Overall, the use of ML, DL, and RL in cybersecurity holds great promise for improving the effectiveness of security systems and enhancing our ability to protect against cyberattacks. However, it is essential to continue developing and refining these techniques to address the ever-evolving nature of cyber threats. Besides, some promising solutions that rely on machine learning, deep learning, and reinforcement learning are susceptible to adversarial attacks, underscoring the importance of factoring in this vulnerability when devising countermeasures against sophisticated cyber threats. We also concluded that ChatGPT can be a valuable tool for cybersecurity, but it should be noted that ChatGPT-like tools can also be manipulated to threaten the integrity, confidentiality, and availability of data.
... RF is a classifier that combines a decision tree on different sets of data and averages them to increase the dataset's prediction performance which is depicted in Fig. 2. The accuracy increases as the number of decision trees in the forest increases, and the error-related issues are prevented. The RF uses several decision trees as part of its ensemble learning approach [18] and formulated in (3) as: ...
Conference Paper
Full-text available
With the growing use of online payment systems, the necessity for strong security measures to defend against fraudulent activity has become critical. Machine learning algorithms-based anomaly detection approaches have developed as efficient solutions for spotting aberrant patterns and detecting fraudulent transactions in online payment systems. It offers efficient and effective online payment monitoring, protecting against fraudulent activity. In the present study, the applications of machine learning techniques for anomaly detection in online payment system is investigated. In conclusion, results provided by four models namely, Logistic Regression, Decision Tree, Random Forest and Extreme Gradient Boosting (XGB) Classifier can be preferred for anomaly detection in online payment. Among the four models, the XGB Classifier provided the highest model accuracy.
Article
In thе dynamic landscapе of hеalthcarе, pеrsonalizеd and holistic patiеnt carе is bеcoming incrеasingly vital. Thе "HEALTHCARE SYSTEM FOR INDIVIDUAL PRAKRITI" projеct offеrs an innovativе approach to undеrstanding patiеnts' individual constitution or "Prakriti" basеd on Ayurvеdic principlеs. This projеct aims to еnhancе hеalthcarе outcomеs by intеgrating traditional Ayurvеdic knowlеdgе with modеrn tеchnology. Thе "HEALTHCARE SYSTEM FOR INDIVIDUAL PRAKRITI" projеct rеprеsеnts a harmonious blеnd of traditional wisdom and contеmporary tеchnology, with thе goal of advancing thе quality of hеalthcarе and promoting a holistic undеrstanding of patiеnt hеalth. This abstract providеs an ovеrviеw of thе projеct's purposе, combining Ayurvеdic principlеs with modеrn hеalthcarе practicеs to bеnеfit both patiеnts and hеalthcarе providеrs.
Article
Full-text available
Machine learning algorithms present a robust alternative for building Intrusion Detection Systems due to their ability to recognize attacks in computer network traffic by recognizing patterns in large amounts of data. Typically, classifiers are trained for this task. Together, ensemble learning algorithms have increased the performance of these detectors, reducing classification errors and allowing computer networks to be more protected. This research presents a comprehensive Systematic Review of the Literature where works related to intrusion detection with ensemble learning were obtained from the most relevant scientific bases. We offer 188 works, several compilations of datasets, classifiers, and ensemble algorithms, and document the experiments that stood out in their performance. A characteristic of this research is its originality. We found two surveys in the literature specifically focusing on the relationship between ensemble techniques and intrusion detection [1] [2]. We present for the last eight years covered by this survey a timeline-based view of the works studied to highlight evolutions and trends. The results obtained by our survey show a growing area, with excellent results in detecting attacks but with needs for improvement in pruning for choosing classifiers, which makes this work unprecedented for this context.
Article
Full-text available
Efficiently detecting network intrusions requires the gathering of sensitive information. This means that one has to collect large amounts of network transactions including high details of recent network transactions. Assessments based on meta-heuristic anomaly are important in the intrusion related network transaction data’s exploratory analysis. These assessments are needed to make and deliver predictions related to the intrusion possibility based on the available attribute details that are involved in the network transaction. We were able to utilize the NSL-KDD data set, the binary and multiclass problem with a 20% testing dataset. This paper develops a new hybrid model that can be used to estimate the intrusion scope threshold degree based on the network transaction data’s optimal features that were made available for training. The experimental results revealed that the hybrid approach had a significant effect on the minimisation of the computational and time complexity involved when determining the feature association impact scale. The accuracy of the proposed model was measured as 99.81% and 98.56% for the binary class and multiclass NSL-KDD data sets, respectively. However, there are issues with obtaining high false and low false negative rates. A hybrid approach with two main parts is proposed to address these issues. First, data needs to be filtered using the Vote algorithm with Information Gain that combines the probability distributions of these base learners in order to select the important features that positively affect the accuracy of the proposed model. Next, the hybrid algorithm consists of following classifiers: J48, Meta Pagging, RandomTree, REPTree, AdaBoostM1, DecisionStump and NaiveBayes. Based on the results obtained using the proposed model, we observe improved accuracy, high false negative rate, and low false positive rule.
Article
Full-text available
Today, the use of social networks is growing ceaselessly and rapidly. More alarming is the fact that these networks have become a substantial pool for unstructured data that belong to a host of domains, including business, governments and health. The increasing reliance on social networks calls for data mining techniques that is likely to facilitate reforming the unstructured data and place them within a systematic pattern. The goal of the present survey is to analyze the data mining techniques that were utilized by social media networks between 2003 and 2015. Espousing criterion-based research strategies, 66 articles were identified to constitute the source of the present paper. After a careful review of these articles, we found that 19 data mining techniques have been used with social media data to address 9 different research objectives in 6 different industrial and services domains. However, the data mining applications in the social media are still raw and require more effort by academia and industry to adequately perform the job. We suggest that more research be conducted by both the academia and the industry since the studies done so far are not sufficiently exhaustive of data mining techniques.
Article
Full-text available
Intrusion detection is a promising area of research in the domain of security with the rapid development of internet in everyday life. Many intrusion detection systems (IDS) employ a sole classifier algorithm for classifying network traffic as normal or abnormal. Due to the large amount of data, these sole classifier models fail to achieve a high attack detection rate with reduced false alarm rate. However by applying dimensionality reduction, data can be efficiently reduced to an optimal set of attributes without loss of information and then classify accurately using multi class modeling technique for identifying the different network attacks. In this paper, we propose an intrusion detection model using chi-square feature selection and multi class support vector machine (SVM). A parameter tuning technique is adopted for optimization of Radial Basis Function kernel parameter namely gamma represented by ‘ϒ’ and over fitting constant ‘C’. These are the two important parameters required for SVM model. The main idea behind this model is to construct a multi class SVM which has not been adopted for IDS so far to decrease the training and testing time and increase the individual classification accuracy of the network attacks. The investigational results on NSL-KDD dataset which is an enhanced version of KDDCup 1999 dataset shows that our proposed approach results in better detection rate and reduced false alarm rate. An experimentation on the computational time required for training and testing is also carried out for usage in time critical applications.
Article
Full-text available
Network intrusion detection systems (NIDSs) have been widely deployed in various network environments to defend against different kinds of network attacks. However, a large number of alarms especially unwanted alarms such as false alarms and non-critical alarms could be generated during the detection, which can greatly decrease the efficiency of the detection and increase the burden of analysis. To address this issue, we advocate that constructing an alarm filter in terms of expert knowledge is a promising solution. In this paper, we develop a method of knowledge-based alert verification and design an intelligent alarm filter based on a multi-class k-nearest-neighbor classifier to filter out unwanted alarms. In particular, the alarm filter employs a rating mechanism by means of expert knowledge to classify incoming alarms to proper clusters for labeling. We further analyze the effect of different classifier settings on classification accuracy with two alarm datasets. In the evaluation, we investigate the performance of the alarm filter with a real dataset and in a network environment, respectively. Experimental results indicate that our alarm filter can effectively filter out a number of NIDS alarms and can achieve a better outcome under the advanced mode. Copyright
Article
Full-text available
Intrusion Detection has been heavily studied in both industry and academia, but cybersecurity analysts still desire much more alert accuracy and overall threat analysis in order to secure their systems within cyberspace. Improvements to Intrusion Detection could be achieved by embracing a more comprehensive approach in monitoring security events from many different heterogeneous sources. Correlating security events from heterogeneous sources can grant a more holistic view and greater situational awareness of cyber threats. One problem with this approach is that currently, even a single event source (e.g., network traffic) can experience Big Data challenges when considered alone. Attempts to use more heterogeneous data sources pose an even greater Big Data challenge. Big Data technologies for Intrusion Detection can help solve these Big Heterogeneous Data challenges. In this paper, we review the scope of works considering the problem of heterogeneous data and in particular Big Heterogeneous Data. We discuss the specific issues of Data Fusion, Heterogeneous Intrusion Detection Architectures, and Security Information and Event Management (SIEM) systems, as well as presenting areas where more research opportunities exist. Overall, both cyber threat analysis and cyber intelligence could be enhanced by correlating security events across many diverse heterogeneous sources.
Article
Full-text available
The Internet of Things has broad application in military field, commerce, environmental monitoring, and many other fields. However, the open nature of the information media and the poor deployment environment have brought great risks to the security of wireless sensor networks, seriously restricting the application of wireless sensor network. Internet of Things composed of wireless sensor network faces security threats mainly from Dos attack, replay attack, integrity attack, false routing information attack, and flooding attack. In this paper, we proposed a new intrusion detection system based on K -nearest neighbor ( K -nearest neighbor, referred to as KNN below) classification algorithm in wireless sensor network. This system can separate abnormal nodes from normal nodes by observing their abnormal behaviors, and we analyse parameter selection and error rate of the intrusion detection system. The paper elaborates on the design and implementation of the detection system. This system has achieved efficient, rapid intrusion detection by improving the wireless ad hoc on-demand distance vector routing protocol (Ad hoc On-Demand Distance the Vector Routing, AODV). Finally, the test results show that: the system has high detection accuracy and speed, in accordance with the requirement of wireless sensor network intrusion detection.
Article
Rapid oral fluid testing (ROFT) devices have been extensively evaluated for their ability to detect common drugs of abuse; however, the performance of such devices on simultaneous screening for ketamine has been scarcely investigated. The present study evaluated three ROFT devices (DrugWipe® 6S, Ora-Check® and SalivaScreen®) on the detection of ketamine, opiates, methamphetamine, cannabis, cocaine and MDMA. A liquid chromatography tandem mass spectrometry (LCMS) assay was firstly established and validated for confirmation analysis of the six types of drugs and/or their metabolites. In the field test, the three ROFT devices were tested on subjects recruited from substance abuse clinics/rehabilitation centre. Oral fluid was also collected using Quantisal® for confirmation analysis. A total of 549 samples were collected in the study. LCMS analysis on 491 samples revealed the following drugs: codeine (55%), morphine (49%), heroin (40%), methamphetamine (35%), THC (8%), ketamine (4%) and cocaine (2%). No MDMA-positive cases were observed. Results showed that the overall specificity and accuracy were satisfactory and met the DRUID standard of >80% for all 3 devices. Ora-Check® had poor sensitivities (ketamine 36%, methamphetamine 63%, opiates 53%, cocaine 60%, THC 0%). DrugWipe® 6S showed good sensitivities in the methamphetamine (83%) and opiates (93%) tests but performed relatively poorly for ketamine (41%), cocaine (43%) and THC (22%). SalivaScreen® also demonstrated good sensitivities in the methamphetamine (83%) and opiates (100%) tests, and had the highest sensitivity for ketamine (76%) and cocaine (71%); however, it failed to detect any of the 28 THC-positive cases. The test completion rate (proportion of tests completed with quality control passed) were: 52% (Ora-Check®), 78% (SalivaScreen®) and 99% (DrugWipe® 6S).
Article
Optimum-path forest (OPF) is a graph-based machine learning method that can overcome some limitations of the traditional machine learning algorithms that have been used in intrusion detection systems. This paper presents a novel approach for intrusion detection using a modified OPF (MOPF) algorithm for improving the performance of traditional OPF in terms of detection rate (DR), false alarm rate (FAR), and time of execution. To address the problem of scalability in large datasets and also for achieving high attack recognition rates, the proposed framework employs the k-means clustering algorithm, as a partitioning module, for generating different homogeneous training subsets from original heterogeneous training samples. In the proposed MOPF algorithm, the distance between unlabeled samples and the root (prototype) of every sample in OPF is also considered in classifying unlabeled samples with the aim of improving the accuracy rate of traditional OPF algorithm. Moreover, the centrality and the prestige concepts in the social network analysis are employed in a pruning module for determining the most informative samples in training subsets to speed up the traditional OPF algorithm. The experimental results on NSL-KDD dataset show that the proposed method performs better than traditional OPF in terms of accuracy rate, DR, FAR, and cost per example (CPE) evaluation metrics.
Article
Redundant and irrelevant features in data have caused a long-term problem in network traffic classification. These features not only slow down the process of classification but also prevent a classifier from making accurate decisions, especially when coping with big data. In this paper, we propose a mutual information based algorithm that analytically selects the optimal feature for classification. This mutual information based feature selection algorithm can handle linearly and nonlinearly dependent data features. Its effectiveness is evaluated in the cases of network intrusion detection. An Intrusion Detection System (IDS), named Least Square Support Vector Machine based IDS (LSSVM-IDS), is built using the features selected by our proposed feature selection algorithm. The performance of LSSVM-IDS is evaluated using three intrusion detection evaluation datasets, namely KDD Cup 99, NSL-KDD and Kyoto 2006+ dataset. The evaluation results show that our feature selection algorithm contributes more critical features for LSSVM-IDS to achieve better accuracy and lower computational cost compared with the state-of-the-art methods.
Article
This paper presents a new feature-selection approach based on the cuttlefish optimization algorithm which is used for intrusion detection systems (IDSs). Because IDSs deal with a large amount of data, one of the crucial tasks of IDSs is to keep the best quality of features that represent the whole data and remove the redundant and irrelevant features. The proposed model uses the cuttlefish algorithm (CFA) as a search strategy to ascertain the optimal subset of features and the decision tree (DT) classifier as a judgement on the selected features that are produced by the CFA. The KDD Cup 99 dataset is used to evaluate the proposed model. The results show that the feature subset obtained by using CFA gives a higher detection rate and accuracy rate with a lower false alarm rate, when compared with the obtained results using all features.