Conference PaperPDF Available

Clustering Enabled Classification using Ensemble Feature Selection for Intrusion Detection

April 2019

April 2019

DOI:10.1109/ICCNC.2019.8685636

Conference: International Conference on Computing, Networking and Communications (ICNC 2019)
At: Hawaii, USA

Authors:

Fadi Salo

The University of Western Ontario

Mohammadnoor Ahmad Mohammad Injadat

Zarqa University

Abdallah Moubayed

Lebanese American University

Ali Bou Nassif

University of Sharjah

Show all 5 authorsHide

Machine learning has been leveraged to increase the effectiveness of intrusion detection systems (IDSs). The focus of this approach, however, has largely be on detecting known attack patterns based on outdated datasets. In this paper, we propose an ensemble feature selection method along with an anomaly detection method that combines unsupervised and supervised machine learning techniques to classify network traffic to identify previously unseen attack patterns. To that end, three different feature selection techniques are used as part of an ensemble model that selects 8 common features. Moreover, k-Means clustering is used to first partition the training instances into k clusters using the Manhattan distance. A classification model is then built based on the resulting clusters, which represent a density region of normal or anomaly instances. This in turn helps determine the effectiveness of the clustering in detecting unknown attack patterns within the data. The performance of our classifier is evaluated using the Kyoto dataset, which was collected between 2006 and 2015. To our knowledge, no previous work proposed such a framework that combines unsupervised and supervised machine learning approaches using this dataset. Experimental results show the effectiveness of the proposed framework in detecting previously unseen attack patterns compared to the traditional classification approach.

Proposed framework

…

Ensemble Feature Selection Output

…

Figures - uploaded by Ali Bou Nassif

Content may be subject to copyright.

Content uploaded by Ali Bou Nassif

Content may be subject to copyright.

Clustering Enabled Classiﬁcation using Ensemble

Feature Selection for Intrusion Detection

Fadi Salo∗, MohammadNoor Injadat∗, Abdallah Moubayed∗, Ali Bou Nassif†, Aleksander Essex∗

∗Department of Electrical and Computer Engineering, The University of Western Ontario, London, ON, Canada

Email: {fsalo, minjadat, amoubaye, aessex}@uwo.ca

†Department of Electrical and Computer Engineering, University of Sharjah, Sharjah, UAE

Email: anassif@sharjah.ac.ae

Abstract—Machine learning has been leveraged to increase the

effectiveness of intrusion detection systems (IDSs). The focus of

this approach, however, has largely be on detecting known attack

patterns based on outdated datasets. In this paper, we propose

an ensemble feature selection method along with an anomaly

detection method that combines unsupervised and supervised

machine learning techniques to classify network trafﬁc to identify

previously unseen attack patterns. To that end, three different

feature selection techniques are used as part of an ensemble

model that selects 8 common features. Moreover, k-Means cluster-

ing is used to ﬁrst partition the training instances into kclusters

using the Manhattan distance. A classiﬁcation model is then built

based on the resulting clusters, which represent a density region

of normal or anomaly instances. This in turn helps determine

the effectiveness of the clustering in detecting unknown attack

patterns within the data. The performance of our classiﬁer is

evaluated using the Kyoto dataset, which was collected between

2006 and 2015. To our knowledge, no previous work proposed

such a framework that combines unsupervised and supervised

machine learning approaches using this dataset. Experimental

results show the effectiveness of the proposed framework in

detecting previously unseen attack patterns compared to the

traditional classiﬁcation approach.

Index Terms—Network anomaly detection, Ensemble feature

selection, k-means clustering, Classiﬁcation, Kyoto dataset.

I. INTRODUCTION

Two performance indicators often used to evaluate the

effectiveness of intrusion detection systems (IDSs) [1] are

precision and stability [2], [3]. Past research has focused on

rule-based systems and statistical approaches [4], however the

performance of such approaches suffers in larger datasets. Data

mining and machine learning approaches have been proposed

as one promising solution to this problem [5].

IDSs typically classify abnormal trafﬁc as either anomaly-

based and misuse-based, with each class having merits and

limitations. Misuse-based methods compare the data against a

predeﬁned set of rules or patterns to detect network attacks.

However, such approaches are not adaptable and are limited in

their ability to detect previously unseen attack types. On the

other hand, anomaly-based methods collect data that represent

normal behavior and build familiarity models, and actions

deviating from the model are labeled as suspicious/anomalous

[6].

Despite the promising improvements achieved by previous

work, intrusion detection is still a challenging problem. This is

exacerbated by the high volume of trafﬁc data, a continuously

evolving environment, and an abundance of available features

[7]. For example, a set of irrelevant, redundant, or highly

correlated features can be found in high dimensional datasets,

which can negatively impact the accuracy and performance of

IDSs. Choosing an appropriate subset of features is essential

to improving the detection model [8].

In this paper, we propose an effective intrusion detection

framework based on ensemble feature selection, clustering,

and supervised machine learning classiﬁers that can detect

previously unseen attack patterns with high accuracy. Toward

the development of an accurate and robust anomaly detection

methodology, we used several graphical and statistical ex-

ploratory data analytics techniques on our chosen dataset. The

result of our analysis led us to select Support Vector Machine

with Gaussian kernels (SVM-RBF), k-nearest neighbors (k-

NN), Random Forest (RF), and quadratic discriminant analysis

(QDA) due to the non-linear nature of the considered dataset.

The performance of our approach is evaluated and com-

pared to the traditional classiﬁcation approach by conducting

different experiments with the Kyoto 2006+ dataset that was

built during a 9 year period of network trafﬁc collection

from honeypots in Kyoto University [9]. To our knowledge,

no previous work proposed such a framework that combines

unsupervised and supervised machine learning approaches

using this dataset.

This paper combines the use of homogeneous ensem-

ble feature selection using three feature selection algorithms

(correlation-based, information gain-based, and signiﬁcance-

based) with k-means clustering and different classiﬁcation

techniques to improve the IDSs ability to detect and iden-

tify unknown patterns. The feasibility and efﬁciency of the

considered methods is evaluated using various metrics such as

accuracy, precision, recall, F-measure, and false alarm rate.

The main contributions of this paper include:

•An investigation of the the behavior and characteristics

of the Kyoto dataset using a graphical and statistical

exploratory data analytics approach.

•A proposal for a homogeneous ensemble feature selection

approach using three base feature selection algorithms.

•The ﬁrst study, to our knowledge, of the efﬁciency and ef-

fectiveness of a clustering-based classiﬁcation framework

for anomaly detection using the Kyoto 2006+ dataset.

The remainder of this paper is organized as follows. Section

II summarizes some of the related work. Section III gives a

brief overview of considered algorithms. Section IV presents

the proposed framework. Section V discusses the research

methodology and the experimental results. Finally, Section VI

concludes the paper and provides future research directions.

II. RE LATE D WOR K

Researchers have typically treated intrusion detection as a

classiﬁcation problem. To that end, various supervised ma-

chine learning classiﬁers such as support vector machines

(SVM) [10], decision trees [11], k-nearest neighbor (k-NN)

[12], and naive Bayes have been proposed. See e.g., the

review of Tsai et al.[13]. Further promising results were

obtained by proposing novel data mining-based approaches

(cf. Wu and Banzhaf [14]). Recently, hybrid optimization-

based models have been proposed to improve the performance

of intrusion detection systems. For instance, Chung and Wahid

[15] proposed a hybrid approach that includes feature se-

lection and classiﬁcation with simpliﬁed swarm optimization

(SSO). The performance of SSO was further improved by

using weighted local search (WLS) to obtain better solutions

from the neighborhood [15]. The authors reported intrusion

detection accuracy up to 93.3%. Similarly, Kuang et al. [16]

proposed a hybrid method that combined genetic algorithm

(GA) and multi-layered SVM with kernel principal compo-

nent analysis (KPCA) to enhance the detection performance.

Another technique by Zhang et al. [17] combined misuse and

anomaly detection using random forests. In contrast, a novel

particle swarm optimization-based algorithm, Catﬁsh-BPSO,

was proposed in [18] to select features and enhance the model

performance.

III. THEORETIC ASP EC TS O F THE TECHNIQUES

Starting with the clustering, the k-means algorithm was

chosen to group the instances into two clusters. The algorithm

was chosen speciﬁcally due to its simplicity, effectiveness

in dealing with network trafﬁc data [19], and ﬂexibility in

offering the choice of the desired number of clusters. In brief,

k-means is an unsupervised machine learning algorithm that

groups instances into kclusters using a particular distance

metric such as Euclidean, Manhattan, or Mahalanobis distance

[20]. This distance metric is used to determine the proximity

of each instance to the cluster centroid. Upon convergence,

the output of the algorithm is the centroid of each of the k

clusters and the cluster label of each instance.

On the other hand, four different supervised machine learn-

ing classiﬁcation algorithms are considered in this work:

support vector machines (SVM), knearest neighbors (k-

NN), random forests (RF), and quadratic discriminant analysis

(QDA). We brieﬂy recall some speciﬁcs. These algorithms are

chosen due to their ability to deal with non-linear datasets.

SVM is a supervised machine learning classiﬁcation algo-

rithm that tries to determine the maximum separation hyper-

plane between two classes to identify the class positive and

negative [21]. The output of the SVM with Gaussian kernel

(SVM-RBF) is [22]:

f(x) = wTΦ(x) + b(1)

where Φ(x)represents the used kernel. The goal is to deter-

mine the weight vector wTand intercept bthat minimizes the

following objective function:

min

w,b

2w2+

i=1

hyi·cost1(f(xi)) + (1 −yi)·cost0(f(xi))i(2)

where Cis a regularization parameter that penalizes incor-

rectly classiﬁed instances and costiis the squared error over

the training dataset.

k-NN is a simple classiﬁcation algorithm that determines

the class of an instance based on the majority class of its k

nearest neighboring points. This is done by ﬁrst evaluating

the distance from the data point to all other points within

the training dataset. Different distance measures can be used,

such as the Euclidean, Manhattan, or Mahalanoblis distance

[20]. After determining the distance, the knearest points are

identiﬁed and a majority voting-based decision is made on the

class of the considered data point [23].

Random forests are an ensemble learning decision tree-

based classiﬁer combining several decision trees to predict the

class [24]. Each tree is independently and randomly sampled

with their results combined using a majority vote. The RF

classiﬁer sends any new incoming data points to each of its

trees and chooses the class that is classiﬁed by the most trees.

Finally, discriminant analysis is a statistical technique that

tries to ﬁnd a group of prediction equations based on inde-

pendent variables [25]. This technique can be used for one

of two objectives, either to determine a prediction equation

that can be used to classify new input points, or to interpret

the equation to get a better understanding of the relationship

that exists between the variables [25]. A quadratic kernel is

one which assumes a quadratic relationship exists between the

independent variables.

IV. PROPOSED FRAMEWOR K

This section presents an overview of the clustering-enabled

classiﬁcation framework for intrusion detection. The goal is

to evaluate the efﬁciency of the framework to detect un-

seen/unknown patterns. This is done in two-folds: ﬁrst by

comparing the result of the clustering to that of the true

label of the instance, and second by using a separate data

sample to act as the testing dataset for unseen data instances.

The framework is as follows: the ﬁrst step consists of se-

lecting appropriate features. To that end, this work proposes

an ensemble feature selection technique that combines three

different feature selection methods, namely information gain,

correlation, and signiﬁcance methods. The features chosen are

the features common among the three methods. Following

feature selection, graphical and statistical data analytics is

applied to get a better understanding of the behavior of the

selected features. Then, the training dataset is clustered using

the k-means algorithm, which we chose due to its simplicity

and effectiveness in dealing with network trafﬁc data [19].

This is followed by building the ﬁnal classiﬁcation model

using the selected classiﬁcation algorithms. The choice of

these algorithms is dependent on the insights gained from the

exploratory data analytics step. The model is then applied to

the testing dataset to evaluate the efﬁciency of the clustering

in detecting previously unseen patterns. The overall ﬂow of

the proposed framework is illustrated in Fig. 1.

Feature Selection

Datasets

Building Model

Testing Model

Result

NormalNormal

Information

Gain

Selected Features

Training

Testing

Clustering

(kMeans)

Classification

(SVM/KNN/RF/QDA)

Final Model

AttackNormal

Correlation Significance

Fig. 1. Proposed framework

V. EXPERIMENTAL SETUP AND RESULT

DISCUSSION

A. Dataset Description

We used the Kyoto 2006+ dataset to evaluate our proposed

framework. This dataset was collected from honeypots by the

University of Kyoto over the 9 year period from Nov. 2006

to Dec. 2015. It consists of 1 million records, each containing

24 features [9]. A random subset of approximately 300,000

records was chosen to form our experimental dataset. This

was further divided into training and testing datasets, using a

60/40 split. The training dataset consisted of 178,479 records

with 92,729 normal and 85,750 attack records. The testing

dataset consisted of 118,986 records with 61,765 normal and

57,221 attack records.

B. Experimental setup and Data Preprocessing

The proposed techniques were implemented using MAT-

LAB 2018a. The selected dataset was transformed from their

original format into a new dataset consisting of 8 features as

illustrated using the Venn diagram shown in Fig. 2.

Correlation

(11)

Information Gain

(11)

Significance

(20)

Fig. 2. Ensemble Feature Selection Output

Outlier removal was performed using Inter-Quartile Range

(IQR) to remove any redundant or noisy data points. As most

of the classiﬁers do not accept categorical features [26], data

mapping was used to transform non-numeric feature values

into numeric ones (named categorical in MATLAB).

C. Prediction Performance Measures

To evaluate and compare prediction models quantitatively,

the following measurements were utilized:

Accuracy =T P +T N

T P +T N +F P +F N (3)

P recision =T P

T P +F P (4)

Recall =T P

T P +F N (5)

F-measure = 2 ·P recision ·Recall

P recision +Recall (6)

where T P is the true positive rate, T N is the true negative

rate, F P is the false positive rate, and F N is the false negative

rate [27].

Fig. 3. Probability Density Function of Normal and Attack Trafﬁc

Fig. 4. Number of Same Service to Same Destination ID vs Same Service

Rate

D. Results Discussion

The goal of this work is to evaluate the effectiveness of

the clustering method in detecting unseen/unknown patterns

within the dataset. However, a crucial initial step to better

understand the dataset is to study its behavior to gain insights

into it. To that end, exploratory data analytics is applied

by plotting the probability density function of two features,

namely the same srv rate (same service rate) and the Dst host

srv count (number of connections of the same service type to

the same destination IP) for both normal and attack trafﬁc as

shown in Fig. 3. It can be seen that normal trafﬁc data tends

to have higher service rate and number of same services to

same destination IP. The service rate refers to the percentage

of the connections that have the same service type (e.g. http,

telnet, etc.) On the other hand, attack trafﬁc data has the exact

opposite trend. These statistical trends give us initial insights

into the behavior of normal and attack trafﬁc that can be

helpful in future prediction.

Fig. 4 plots the two highest ranked numeric features against

each other. It plots the number of same service to the same

destination against the same service rate. It is clear that the

dataset is not linearly separable. This provides further insights

and justiﬁes the choice of using SVM-RBF, k-NN, RF, and

QDA methods as they can handle non-linear data.

Table I shows the performance of the considered classiﬁ-

cation algorithms for both the training and testing datasets

without clustering. These results will be used as a benchmark

to gauge the effectiveness of the clustering algorithm in

detecting previously unseen patterns.

Table II, on the other hand, shows the performance of

the different classiﬁcation algorithms with clustering. Several

observations can be made. First is that almost all algorithms

TABLE I

PER FOR MA NCE R ESU LTS OF T HE T HRE E CL ASS IFI ERS W IT HOU T CL UST ER ING

Training Testing

Classiﬁer Acc(%) Precision Recall FAR F-measure Acc(%) Precision Recall FAR F-measure

SVM-RBF 96.88 0.982 0.958 0.018 0.97 80.03 0.767 0.883 0.288 0.821

k-NN (k=3) 98.03 0.985 0.977 0.016 0.981 86.95 0.882 0.864 0.124 0.873

k-NN (k=5) 97.65 0.982 0.972 0.018 0.977 88.33 0.898 0.875 0.107 0.886

RF 98.52 0.994 0.978 0.006 0.986 57.79 0.911 0.207 0.021 0.337

QDA 87.1 0.926 0.817 0.071 0.868 87.02 0.936 0.805 0.059 0.866

TABLE II

PER FOR MA NCE R ESU LTS OF T HE T HRE E CL ASS IFI ERS W IT H CLU ST ERI NG

Training Testing

Classiﬁer Acc(%) Precision Recall FAR F-measure Acc(%) Precision Recall FAR F-measure

SVM-RBF 98.24 0.988 0.984 0.02 0.986 76.89 0.716 0.92 0.393 0.805

k-NN (k=3) 98.96 0.991 0.991 0.013 0.992 82.28 0.783 0.911 0.272 0.842

k-NN (k=5) 98.56 0.989 0.988 0.018 0.988 81.33 0.768 0.918 0.299 0.836

RF 99.98 0.999 0.999 0.001 0.999 79.06 0.741 0.916 0.344 0.82

QDA 93.49 0.973 0.92 0.041 0.946 81.63 0.784 0.892 0.265 0.834

(with the exception of QDA) achieved good training accuracy,

as shown in Table I. However, this was not necessarily

translated in the testing accuracy. This shows that algorithms

such as RF are not well suited to anomaly detection as it only

had a testing accuracy of approximately 58%. The second

observation is that using clustering to detect anomalies is

effective. This is based on the fact that the testing accuracy

of the classiﬁers after clustering is close to that of the non-

clustering case. This shows that clustering is able to detect

previously unseen patterns effectively. This is further high-

lighted in Figs. 5 and 6 which show the training and testing

accuracy of the different classiﬁcation algorithms with and

without clustering. The results in Fig. 5 are expected, since

the model was trained using this dataset and hence will have

a high accuracy. However, the results in Fig. 6 emphasize the

efﬁciency of the proposed clustering-enabled classiﬁcation. It

can be seen that the difference in testing accuracy between the

clustering and the non-clustering cases is less than 10% for

most classiﬁcation algorithms. This means that the clustering

was able to predict previously unseen patterns with a relatively

high accuracy. By comparison, results reported in [28] only

shown an accuracy between 60% and 77%. However, this

proposed work was able to achieve higher testing accuracy

with the lowest being approximately 77%, thus illustrating

the effectiveness of the proposed framework. Moreover, it

can be concluded that k-NN with 3 neighbors has the best

performance given that it achieved high training accuracy and

had the smallest difference in testing accuracy between the

non-clustering and the clustering cases.

VI. CONCLUSIONS

In this paper, an efﬁcient intrusion detection framework

based on homogeneous ensemble feature selection, clustering,

and supervised machine learning classiﬁers was proposed. This

was done in order to test the efﬁciency of the clustering

algorithm in detecting previously unseen attack patterns for

96.88

98.03 97.65

98.52

87.10

98.24 98.96 98.56

99.98

93.49

SVM-RBF k-NN(k=3) k-NN(k=5) RF QDA

100

Accuracy

Without Clustering

With Clustering

Fig. 5. Overall Accuracy of Training Dataset

80.03

86.95 88.33

57.79

87.02

76.89

82.28 81.33 79.05

81.63

SVM-RBF k-NN(k=3) k-NN(k=5) RF QDA

100

Accuracy

Without Clustering

With Clustering

Fig. 6. Overall Accuracy of Testing Dataset

intrusion detection. The techniques considered were chosen

based on the nature of the selected dataset which was investi-

gated using different graphical and statistical exploratory data

analytics techniques such as the probability density function.

The performance was evaluated and compared by conducting

different experiments with the Kyoto 2006+ dataset that was

built during a 9 years of real trafﬁc data collection (between

Nov. 2006 and Dec. 2015) from diverse types of honeypots in

Kyoto University. To the best of our knowledge, no previous

work proposed such a framework using this dataset. To explore

the dataset, a homogeneous ensemble feature selection mech-

anism using three feature selection algorithms (correlation-

based, information gain-based, and signiﬁcance-based) was

applied to extract 8 representative features. This was followed

by applying different graphical and statistical exploratory data

analytics techniques to better understand the behavior of the

features. The results of this data analysis showed that the

dataset is highly non-linear, which motivated the choice of

the considered supervised classiﬁcation algorithms. The new

dataset was then clustered using k-means algorithm and a clas-

siﬁcation model was then built using different classiﬁcation

techniques to improve the IDS’s ability to detect and identify

unknown patterns. Experimental results showed that k-means

clustering was indeed efﬁcient in detecting previously unseen

patterns. This was highlighted by the small difference in

testing accuracy between the clustering and the non-clustering

cases which did not exceed the 10% range for most classiﬁers.

Furthermore, it was also shown that the k-NN algorithm with

k= 3 had the best performance as it achieved high training

accuracy and had the smallest testing accuracy difference. In

order to further improve the performance of the proposed

approach, we plan to develop an adaptive model that clusters

any new attacks with existing ones. This in turn will provide

a more robust and dynamic intrusion detection system and

improve its security.

REFERENCES

[1] R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection

and big heterogeneous data: a survey,” Journal of Big Data, vol. 2,

no. 1, p. 3, Feb 2015. [Online]. Available: https://doi.org/10.1186/

s40537-015- 0013-4

[2] L. de S´

a Silva, A. C. F. dos Santos, T. D. Mancilha, J. D. S. da Silva,

and A. Montes, “Detecting attack signatures in the real network trafﬁc

with annida.” Elsevier, 2008, vol. 34, no. 4, pp. 2326–2333.

[3] A. Patcha and J.-M. Park, “An overview of anomaly detection tech-

niques: Existing solutions and latest technological trends,” Computer

networks, vol. 51, no. 12, pp. 3448–3470, 2007.

[4] S. Mukkamala, G. Janoski, and A. Sung, “Intrusion detection using

neural networks and support vector machines,” vol. 2, pp. 1702–1707,

2002.

[5] S.-Y. Wu and E. Yen, “Data mining-based intrusion detectors,” Expert

Systems with Applications, vol. 36, no. 3, pp. 5605–5612, 2009.

[6] C. Kruegel, F. Valeur, and G. Vigna, “Intrusion detection and correlation:

challenges and solutions,” vol. 14, 2004.

[7] S. Suthaharan, “Big data classiﬁcation: Problems and challenges in net-

work intrusion prediction with machine learning,” ACM SIGMETRICS

Performance Evaluation Review, vol. 41, no. 4, pp. 70–73, 2014.

[8] J. Zhang and M. Zulkernine, “Anomaly based network intrusion de-

tection with unsupervised outlier detection,” in Communications, 2006.

ICC’06. IEEE International Conference on, vol. 5. IEEE, 2006, pp.

2388–2393.

[9] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an intrusion

detection system using a ﬁlter-based feature selection algorithm,” IEEE

Transactions on Computers, vol. 65, no. 10, pp. 2986–2998, Oct 2016.

[10] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel feature-selection

approach based on the cuttleﬁsh optimization algorithm for intrusion

detection systems,” Expert Systems with Applications, vol. 42, no. 5,

pp. 2670–2679, 2015.

[11] W. Li, P. Yi, Y. Wu, L. Pan, and J. Li, “A new intrusion detection system

based on knn classiﬁcation algorithm in wireless sensor network,”

Journal of Electrical and Computer Engineering, vol. 2014, 2014.

[12] S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based in-

trusion detection system through feature selection analysis and building

hybrid efﬁcient model,” Journal of Computational Science, 2017.

[13] C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin, and W.-Y. Lin, “Intrusion detection by

machine learning: A review,” Expert Systems with Applications, vol. 36,

no. 10, pp. 11 994–12 000, 2009.

[14] S. X. Wu and W. Banzhaf, “The use of computational intelligence in

intrusion detection systems: A review,” Applied soft computing, vol. 10,

no. 1, pp. 1–35, 2010.

[15] Y. Y. Chung and N. Wahid, “A hybrid network intrusion detection system

using simpliﬁed swarm optimization (sso),” Applied Soft Computing,

vol. 12, no. 9, pp. 3014–3022, 2012.

[16] F. Kuang, W. Xu, and S. Zhang, “A novel hybrid kpca and svm with

ga model for intrusion detection,” Applied Soft Computing, vol. 18, pp.

178–184, 2014.

[17] J. Zhang, M. Zulkernine, and A. Haque, “Random-forests-based network

intrusion detection systems,” IEEE Transactions on Systems, Man, and

Cybernetics, Part C (Applications and Reviews), vol. 38, no. 5, pp. 649–

659, 2008.

[18] A. J. Malik and F. A. Khan, “A hybrid technique using multi-objective

particle swarm optimization and random forests for probe attacks

detection in a network,” in Systems, Man, and Cybernetics (SMC), 2013

IEEE International Conference on. IEEE, 2013, pp. 2473–2478.

[19] Y. Liu, W. Li, and Y.-C. Li, “Network trafﬁc classiﬁcation using k-means

clustering,” in Computer and Computational Sciences, 2007. IMSCCS

2007. Second International Multi-Symposiums on. IEEE, 2007, pp.

360–365.

[20] A. Kind, M. P. Stoecklin, and X. Dimitropoulos, “Histogram-based

trafﬁc anomaly detection,” IEEE Transactions on Network and Service

Management, vol. 6, no. 2, pp. 110–121, 2009.

[21] I. S. Thaseen and C. A. Kumar, “Intrusion detection model using fusion

of chi-square feature selection and multi class svm,” Journal of King

Saud University-Computer and Information Sciences, vol. 29, no. 4, pp.

462–472, 2017.

[22] H. Bostani and M. Sheikhan, “Modiﬁcation of supervised opf-based

intrusion detection systems using unsupervised learning and social

network concept,” Pattern Recognition, vol. 62, pp. 56–72, 2017.

[23] W. Meng, W. Li, and L.-F. Kwok, “Design of intelligent knn-based alarm

ﬁlter using knowledge-based alert veriﬁcation in intrusion detection,”

Security and Communication Networks, vol. 8, no. 18, pp. 3883–3895,

2015.

[24] M. Injadat, F. Salo, and A. B. Nassif, “Data mining techniques in

social media: A survey,” Neurocomputing, vol. 214, pp. 654 – 670,

2016. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/S092523121630683X

[25] N. S. Software, “Chapter 440:discriminant analysis,” Available at:

https://ncss-wpengine.netdna- ssl.com/wp-content/themes/ncss/pdf/

Procedures/NCSS/Discriminant Analysis.pdf.

[26] M. Salem and U. Buehler, “Mining techniques in network security to

enhance intrusion detection systems,” arXiv preprint arXiv:1212.2414,

2012.

[27] M. H. Tang, C. Ching, S. Poon, S. S. Chan, W. Ng, M. Lam, C. Wong,

R. Pao, A. Lau, and T. W. Mak, “Evaluation of three rapid oral ﬂuid test

devices on the screening of multiple drugs of abuse including ketamine,”

Forensic science international, 2018.

[28] F. Hosseinpour, P. V. Amoli, F. Farahnakian, J. Plosila, and

T. H¨

am¨

al¨

ainen, “Artiﬁcial immune system based intrusion detection:

innate immunity using an unsupervised learning approach,” International

Journal of Digital Content Technology and its Applications, vol. 8, no. 5,

p. 1, 2014.

OPEN ACCESS EDITED BY

Article

Full-text available

Jun 2024

The advancement of communication and internet technology has brought risks to network security. Thus, Intrusion Detection Systems (IDS) was developed to combat malicious network attacks. However, IDSs still struggle with accuracy, false alarms, and detecting new intrusions. Therefore, organizations are using Machine Learning (ML) and Deep Learning (DL) algorithms in IDS for more accurate attack detection. This paper provides an overview of IDS, including its classes and methods, the detected attacks as well as the dataset, metrics, and performance indicators used. A thorough examination of recent publications on IDS-based solutions is conducted, evaluating their strengths and weaknesses, as well as a discussion of their potential implications, research challenges, and new trends. We believe that this comprehensive review paper covers the most recent advances and developments in ML and DL-based IDS, and also facilitates future research into the potential of emerging Artificial Intelligence (AI) to address the growing complexity of cybersecurity challenges

A Comprehensive Survey: Evaluating the Efficiency of Artificial Intelligence and Machine Learning Techniques on Cyber Security Solutions

Article

Full-text available

Jan 2024

Given the continually rising frequency of cyberattacks, the adoption of artificial intelligence methods, particularly Machine Learning (ML), Deep Learning (DL), and Reinforcement Learning (RL), has become essential in the realm of cybersecurity. These techniques have proven to be effective in detecting and mitigating cyberattacks, which can cause significant harm to individuals, organizations, and even countries. Machine learning algorithms use statistical methods to identify patterns and anomalies in large datasets, enabling security analysts to detect previously unknown threats. Deep learning, a subfield of ML, has shown great potential in improving the accuracy and efficiency of cybersecurity systems, particularly in image and speech recognition. On the other hand, RL is again a subfield of machine learning that trains algorithms to learn through trial and error, making it particularly effective in dynamic environments. We also evaluated the usage of ChatGPT-like AI tools in cyber-related problem domains on both sides, positive and negative. This article provides an overview of how ML, DL, and RL are applied in cybersecurity, including their usage in malware detection, intrusion detection, vulnerability assessment, and other areas. The state-of-the-art studies using ML, DL, and RL models are evaluated in each section based on the main idea, techniques, and important findings. It also discusses these techniques’ challenges and limitations, including data quality, interpretability, and adversarial attacks. Overall, the use of ML, DL, and RL in cybersecurity holds great promise for improving the effectiveness of security systems and enhancing our ability to protect against cyberattacks. However, it is essential to continue developing and refining these techniques to address the ever-evolving nature of cyber threats. Besides, some promising solutions that rely on machine learning, deep learning, and reinforcement learning are susceptible to adversarial attacks, underscoring the importance of factoring in this vulnerability when devising countermeasures against sophisticated cyber threats. We also concluded that ChatGPT can be a valuable tool for cybersecurity, but it should be noted that ChatGPT-like tools can also be manipulated to threaten the integrity, confidentiality, and availability of data.

Machine Learning Models for Detecting Anomalies in Online Payment: A Comparative Analysis

Conference Paper

Full-text available

Sep 2023

With the growing use of online payment systems, the necessity for strong security measures to defend against fraudulent activity has become critical. Machine learning algorithms-based anomaly detection approaches have developed as efficient solutions for spotting aberrant patterns and detecting fraudulent transactions in online payment systems. It offers efficient and effective online payment monitoring, protecting against fraudulent activity. In the present study, the applications of machine learning techniques for anomaly detection in online payment system is investigated. In conclusion, results provided by four models namely, Logistic Regression, Decision Tree, Random Forest and Extreme Gradient Boosting (XGB) Classifier can be preferred for anomaly detection in online payment. Among the four models, the XGB Classifier provided the highest model accuracy.

Optimized Ensemble Model Towards Secured Industrial IoT Devices

Conference Paper

Dec 2023

Mohammadnoor Ahmad Mohammad Injadat

Healthcare System for Individual Prakriti

Article

Feb 2024

Prof. Girisha Bombale

In thе dynamic landscapе of hеalthcarе, pеrsonalizеd and holistic patiеnt carе is bеcoming incrеasingly vital. Thе "HEALTHCARE SYSTEM FOR INDIVIDUAL PRAKRITI" projеct offеrs an innovativе approach to undеrstanding patiеnts' individual constitution or "Prakriti" basеd on Ayurvеdic principlеs. This projеct aims to еnhancе hеalthcarе outcomеs by intеgrating traditional Ayurvеdic knowlеdgе with modеrn tеchnology. Thе "HEALTHCARE SYSTEM FOR INDIVIDUAL PRAKRITI" projеct rеprеsеnts a harmonious blеnd of traditional wisdom and contеmporary tеchnology, with thе goal of advancing thе quality of hеalthcarе and promoting a holistic undеrstanding of patiеnt hеalth. This abstract providеs an ovеrviеw of thе projеct's purposе, combining Ayurvеdic principlеs with modеrn hеalthcarе practicеs to bеnеfit both patiеnts and hеalthcarе providеrs.

Machine learning for sports betting: Should model selection be based on accuracy or calibration?

Article

Feb 2024

Interactive effects of hyperparameter optimization techniques and data characteristics on the performance of machine learning algorithms for building energy metamodeling

Article

Feb 2024

Securing IoT-Edge Networks: Federated Deep Learning for Botnet Detection

Conference Paper

Nov 2023

Enhancing Intrusion Detection Systems Through Dimensionality Reduction: A Comparative Study of Machine Learning Techniques for Cyber Security

Article

Jan 2024

A Comprehensive Survey on Ensemble Learning-Based Intrusion Detection Approaches in Computer Networks

Article

Full-text available

Jan 2023

Machine learning algorithms present a robust alternative for building Intrusion Detection Systems due to their ability to recognize attacks in computer network traffic by recognizing patterns in large amounts of data. Typically, classifiers are trained for this task. Together, ensemble learning algorithms have increased the performance of these detectors, reducing classification errors and allowing computer networks to be more protected. This research presents a comprehensive Systematic Review of the Literature where works related to intrusion detection with ensemble learning were obtained from the most relevant scientific bases. We offer 188 works, several compilations of datasets, classifiers, and ensemble algorithms, and document the experiments that stood out in their performance. A characteristic of this research is its originality. We found two surveys in the literature specifically focusing on the relationship between ensemble techniques and intrusion detection [1] [2]. We present for the last eight years covered by this survey a timeline-based view of the works studied to highlight evolutions and trends. The results obtained by our survey show a growing area, with excellent results in detecting attacks but with needs for improvement in pruning for choosing classifiers, which makes this work unprecedented for this context.

Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model

Article

Full-text available

Mar 2017

Efficiently detecting network intrusions requires the gathering of sensitive information. This means that one has to collect large amounts of network transactions including high details of recent network transactions. Assessments based on meta-heuristic anomaly are important in the intrusion related network transaction data’s exploratory analysis. These assessments are needed to make and deliver predictions related to the intrusion possibility based on the available attribute details that are involved in the network transaction. We were able to utilize the NSL-KDD data set, the binary and multiclass problem with a 20% testing dataset. This paper develops a new hybrid model that can be used to estimate the intrusion scope threshold degree based on the network transaction data’s optimal features that were made available for training. The experimental results revealed that the hybrid approach had a significant effect on the minimisation of the computational and time complexity involved when determining the feature association impact scale. The accuracy of the proposed model was measured as 99.81% and 98.56% for the binary class and multiclass NSL-KDD data sets, respectively. However, there are issues with obtaining high false and low false negative rates. A hybrid approach with two main parts is proposed to address these issues. First, data needs to be filtered using the Vote algorithm with Information Gain that combines the probability distributions of these base learners in order to select the important features that positively affect the accuracy of the proposed model. Next, the hybrid algorithm consists of following classifiers: J48, Meta Pagging, RandomTree, REPTree, AdaBoostM1, DecisionStump and NaiveBayes. Based on the results obtained using the proposed model, we observe improved accuracy, high false negative rate, and low false positive rule.

Data Mining Techniques in Social Media: A Survey

Article

Full-text available

Jun 2016
NEUROCOMPUTING

Today, the use of social networks is growing ceaselessly and rapidly. More alarming is the fact that these networks have become a substantial pool for unstructured data that belong to a host of domains, including business, governments and health. The increasing reliance on social networks calls for data mining techniques that is likely to facilitate reforming the unstructured data and place them within a systematic pattern. The goal of the present survey is to analyze the data mining techniques that were utilized by social media networks between 2003 and 2015. Espousing criterion-based research strategies, 66 articles were identified to constitute the source of the present paper. After a careful review of these articles, we found that 19 data mining techniques have been used with social media data to address 9 different research objectives in 6 different industrial and services domains. However, the data mining applications in the social media are still raw and require more effort by academia and industry to adequately perform the job. We suggest that more research be conducted by both the academia and the industry since the studies done so far are not sufficiently exhaustive of data mining techniques.

Intrusion Detection Model Using fusion of Chi-square feature selection and multi class SVM

Article

Full-text available

Mar 2016

Intrusion detection is a promising area of research in the domain of security with the rapid development of internet in everyday life. Many intrusion detection systems (IDS) employ a sole classifier algorithm for classifying network traffic as normal or abnormal. Due to the large amount of data, these sole classifier models fail to achieve a high attack detection rate with reduced false alarm rate. However by applying dimensionality reduction, data can be efficiently reduced to an optimal set of attributes without loss of information and then classify accurately using multi class modeling technique for identifying the different network attacks. In this paper, we propose an intrusion detection model using chi-square feature selection and multi class support vector machine (SVM). A parameter tuning technique is adopted for optimization of Radial Basis Function kernel parameter namely gamma represented by ‘ϒ’ and over fitting constant ‘C’. These are the two important parameters required for SVM model. The main idea behind this model is to construct a multi class SVM which has not been adopted for IDS so far to decrease the training and testing time and increase the individual classification accuracy of the network attacks. The investigational results on NSL-KDD dataset which is an enhanced version of KDDCup 1999 dataset shows that our proposed approach results in better detection rate and reduced false alarm rate. An experimentation on the computational time required for training and testing is also carried out for usage in time critical applications.

Design of intelligent KNN‐based alarm filter using knowledge‐based alert verification in intrusion detection

Article

Full-text available

Jul 2015

Network intrusion detection systems (NIDSs) have been widely deployed in various network environments to defend against different kinds of network attacks. However, a large number of alarms especially unwanted alarms such as false alarms and non-critical alarms could be generated during the detection, which can greatly decrease the efficiency of the detection and increase the burden of analysis. To address this issue, we advocate that constructing an alarm filter in terms of expert knowledge is a promising solution. In this paper, we develop a method of knowledge-based alert verification and design an intelligent alarm filter based on a multi-class k-nearest-neighbor classifier to filter out unwanted alarms. In particular, the alarm filter employs a rating mechanism by means of expert knowledge to classify incoming alarms to proper clusters for labeling. We further analyze the effect of different classifier settings on classification accuracy with two alarm datasets. In the evaluation, we investigate the performance of the alarm filter with a real dataset and in a network environment, respectively. Experimental results indicate that our alarm filter can effectively filter out a number of NIDS alarms and can achieve a better outcome under the advanced mode. Copyright

Intrusion detection and Big Heterogeneous Data: a Survey

Article

Full-text available

Feb 2015

Intrusion Detection has been heavily studied in both industry and academia, but cybersecurity analysts still desire much more alert accuracy and overall threat analysis in order to secure their systems within cyberspace. Improvements to Intrusion Detection could be achieved by embracing a more comprehensive approach in monitoring security events from many different heterogeneous sources. Correlating security events from heterogeneous sources can grant a more holistic view and greater situational awareness of cyber threats. One problem with this approach is that currently, even a single event source (e.g., network traffic) can experience Big Data challenges when considered alone. Attempts to use more heterogeneous data sources pose an even greater Big Data challenge. Big Data technologies for Intrusion Detection can help solve these Big Heterogeneous Data challenges. In this paper, we review the scope of works considering the problem of heterogeneous data and in particular Big Heterogeneous Data. We discuss the specific issues of Data Fusion, Heterogeneous Intrusion Detection Architectures, and Security Information and Event Management (SIEM) systems, as well as presenting areas where more research opportunities exist. Overall, both cyber threat analysis and cyber intelligence could be enhanced by correlating security events across many diverse heterogeneous sources.

A New Intrusion Detection System Based on KNN Classification Algorithm in Wireless Sensor Network

Article

Full-text available

Jun 2014

The Internet of Things has broad application in military field, commerce, environmental monitoring, and many other fields. However, the open nature of the information media and the poor deployment environment have brought great risks to the security of wireless sensor networks, seriously restricting the application of wireless sensor network. Internet of Things composed of wireless sensor network faces security threats mainly from Dos attack, replay attack, integrity attack, false routing information attack, and flooding attack. In this paper, we proposed a new intrusion detection system based on K -nearest neighbor ( K -nearest neighbor, referred to as KNN below) classification algorithm in wireless sensor network. This system can separate abnormal nodes from normal nodes by observing their abnormal behaviors, and we analyse parameter selection and error rate of the intrusion detection system. The paper elaborates on the design and implementation of the detection system. This system has achieved efficient, rapid intrusion detection by improving the wireless ad hoc on-demand distance vector routing protocol (Ad hoc On-Demand Distance the Vector Routing, AODV). Finally, the test results show that: the system has high detection accuracy and speed, in accordance with the requirement of wireless sensor network intrusion detection.

Evaluation of three rapid oral fluid test devices on the screening of multiple drugs of abuse including ketamine

Article

Mar 2018
FORENSIC SCI INT

Rapid oral fluid testing (ROFT) devices have been extensively evaluated for their ability to detect common drugs of abuse; however, the performance of such devices on simultaneous screening for ketamine has been scarcely investigated. The present study evaluated three ROFT devices (DrugWipe® 6S, Ora-Check® and SalivaScreen®) on the detection of ketamine, opiates, methamphetamine, cannabis, cocaine and MDMA. A liquid chromatography tandem mass spectrometry (LCMS) assay was firstly established and validated for confirmation analysis of the six types of drugs and/or their metabolites. In the field test, the three ROFT devices were tested on subjects recruited from substance abuse clinics/rehabilitation centre. Oral fluid was also collected using Quantisal® for confirmation analysis. A total of 549 samples were collected in the study. LCMS analysis on 491 samples revealed the following drugs: codeine (55%), morphine (49%), heroin (40%), methamphetamine (35%), THC (8%), ketamine (4%) and cocaine (2%). No MDMA-positive cases were observed. Results showed that the overall specificity and accuracy were satisfactory and met the DRUID standard of >80% for all 3 devices. Ora-Check® had poor sensitivities (ketamine 36%, methamphetamine 63%, opiates 53%, cocaine 60%, THC 0%). DrugWipe® 6S showed good sensitivities in the methamphetamine (83%) and opiates (93%) tests but performed relatively poorly for ketamine (41%), cocaine (43%) and THC (22%). SalivaScreen® also demonstrated good sensitivities in the methamphetamine (83%) and opiates (100%) tests, and had the highest sensitivity for ketamine (76%) and cocaine (71%); however, it failed to detect any of the 28 THC-positive cases. The test completion rate (proportion of tests completed with quality control passed) were: 52% (Ora-Check®), 78% (SalivaScreen®) and 99% (DrugWipe® 6S).

Modification of Supervised OPF-based Intrusion Detection Systems Using Unsupervised Learning and Social Network Concept

Article

Aug 2016
PATTERN RECOGN

Optimum-path forest (OPF) is a graph-based machine learning method that can overcome some limitations of the traditional machine learning algorithms that have been used in intrusion detection systems. This paper presents a novel approach for intrusion detection using a modified OPF (MOPF) algorithm for improving the performance of traditional OPF in terms of detection rate (DR), false alarm rate (FAR), and time of execution. To address the problem of scalability in large datasets and also for achieving high attack recognition rates, the proposed framework employs the k-means clustering algorithm, as a partitioning module, for generating different homogeneous training subsets from original heterogeneous training samples. In the proposed MOPF algorithm, the distance between unlabeled samples and the root (prototype) of every sample in OPF is also considered in classifying unlabeled samples with the aim of improving the accuracy rate of traditional OPF algorithm. Moreover, the centrality and the prestige concepts in the social network analysis are employed in a pruning module for determining the most informative samples in training subsets to speed up the traditional OPF algorithm. The experimental results on NSL-KDD dataset show that the proposed method performs better than traditional OPF in terms of accuracy rate, DR, FAR, and cost per example (CPE) evaluation metrics.

Building an Intrusion Detection System Using a Filter-Based Feature Selection Algorithm

Article

Oct 2016

Redundant and irrelevant features in data have caused a long-term problem in network traffic classification. These features not only slow down the process of classification but also prevent a classifier from making accurate decisions, especially when coping with big data. In this paper, we propose a mutual information based algorithm that analytically selects the optimal feature for classification. This mutual information based feature selection algorithm can handle linearly and nonlinearly dependent data features. Its effectiveness is evaluated in the cases of network intrusion detection. An Intrusion Detection System (IDS), named Least Square Support Vector Machine based IDS (LSSVM-IDS), is built using the features selected by our proposed feature selection algorithm. The performance of LSSVM-IDS is evaluated using three intrusion detection evaluation datasets, namely KDD Cup 99, NSL-KDD and Kyoto 2006+ dataset. The evaluation results show that our feature selection algorithm contributes more critical features for LSSVM-IDS to achieve better accuracy and lower computational cost compared with the state-of-the-art methods.

A novel feature-selection approach based on the cuttlefish optimization algorithm for intrusion detection systems

Article

Nov 2015
EXPERT SYST APPL

This paper presents a new feature-selection approach based on the cuttlefish optimization algorithm which is used for intrusion detection systems (IDSs). Because IDSs deal with a large amount of data, one of the crucial tasks of IDSs is to keep the best quality of features that represent the whole data and remove the redundant and irrelevant features. The proposed model uses the cuttlefish algorithm (CFA) as a search strategy to ascertain the optimal subset of features and the decision tree (DT) classifier as a judgement on the selected features that are produced by the CFA. The KDD Cup 99 dataset is used to evaluate the proposed model. The results show that the feature subset obtained by using CFA gives a higher detection rate and accuracy rate with a lower false alarm rate, when compared with the obtained results using all features.

Clustering Enabled Classification using Ensemble Feature Selection for Intrusion Detection

Abstract and Figures

Recommended publications

Reconstructing Classification to Enhance Machine-Learning Based Network Intrusion Detection by Embra...

Bayesian Optimization with Machine Learning Algorithms Towards Anomaly Detection

Bayesian Optimization with Machine Learning Algorithms Towards Anomaly Detection

Multi-Stage Optimized Machine Learning Framework for Network Intrusion Detection

Multi-Stage Optimized Machine Learning Framework for Network Intrusion Detection