ThesisPDF Available

A ROBUST GRADIENT BOOSTING MODEL BASED ON SMOTE AND NEAR MISS METHODS FOR INTRUSION DETECTION IN IMBALANCED DATA SETS

Authors:

Abstract and Figures

Novel technologies cause many security vulnerabilities and zero-day attack risks. Intrusion Detection Systems (IDS) are developed to protect computer networks from threats and attacks. Many challenging problems need to be solved in existing methods. The class imbalance problem is one of the most difficult problems of IDS, and it reduces the detection rate performance of the classifiers. The highest IDS detection rate in the literature is 96.54%. This thesis proposes a new model called ROGONG-IDS (Robust Gradient Boosting) based on Gradient Boosting. ROGONGIDS model uses Synthetic Minority Over-Sampling Technique (SMOTE) and Near Miss methods to handle class imbalance. Three different gradient boosting-based classification algorithms (GBM, LightGBM, XGBoost) were compared. The performance of the proposed model on multiclass classification has been verified in the UNSW-NB15 dataset. It reached the highest attack detection rate and 𝐹1 score in the literature with a 97.30% detection rate and 97.65% 𝐹1 score. ROGONG-IDS provides a robust, efficient solution for IDS built on datasets with the imbalanced class distribution. It outperforms state-of-the-art and traditional intrusion detection methods.
Content may be subject to copyright.
A ROBUST GRADIENT BOOSTING MODEL BASED ON
SMOTE AND NEAR MISS METHODS FOR INTRUSION
DETECTION IN IMBALANCED DATA SETS
AHMET OKAN ARIK
IŞIK UNIVERSITY
JANUARY, 2022
A ROBUST GRADIENT BOOSTING MODEL BASED ON SMOTE
AND NEAR MISS METHODS FOR INTRUSION DETECTION IN
IMBALANCED DATA SETS
AHMET OKAN ARIK
Işık University, School of Graduate Studies, the Degree of Master of Science in
Information Technologies
2022
This thesis submitted to, Işık University, School of Graduate Studies for Master of
Science Degree in Information Technologies.
IŞIK UNIVERSITY
January, 2022
IŞIK UNIVERSITY
SCHOOL OF GRADUATE STUDIES
MASTER OF SCIENCE IN INFORMATION TECHNOLOGIES
A ROBUST GRADIENT BOOSTING MODEL BASED ON SMOTE AND NEAR
MISS METHODS FOR INTRUSION DETECTION IN IMBALANCED DATA
SETS
AHMET OKAN ARIK
APPROVED BY:
Asst. Prof. Gülsüm Çiğdem
Çavdaroğlu Akkoç
(Thesis Advisor)
Işık University / MIS
_______________
Asst. Prof. Şahin Aydın
Işık University / MIS
_______________
Asst. Prof. Zeynep Turgut
Akgün
İstanbul Medeniyet
University / Computer
Engineering
_______________
APPROVAL DATE: 18/01/2022
ii
A ROBUST GRADIENT BOOSTING MODEL BASED ON
SMOTE AND NEAR MISS METHODS FOR INTRUSION
DETECTION IN IMBALANCED DATA SETS
ABSTRACT
Novel technologies cause many security vulnerabilities and zero-day attack
risks. Intrusion Detection Systems (IDS) are developed to protect computer networks
from threats and attacks. Many challenging problems need to be solved in existing
methods. The class imbalance problem is one of the most difficult problems of IDS,
and it reduces the detection rate performance of the classifiers. The highest IDS
detection rate in the literature is 96.54%. This thesis proposes a new model called
ROGONG-IDS (Robust Gradient Boosting) based on Gradient Boosting. ROGONG-
IDS model uses Synthetic Minority Over-Sampling Technique (SMOTE) and Near
Miss methods to handle class imbalance. Three different gradient boosting-based
classification algorithms (GBM, LightGBM, XGBoost) were compared. The
performance of the proposed model on multiclass classification has been verified in
the UNSW-NB15 dataset. It reached the highest attack detection rate and 𝐹
1 score in
the literature with a 97.30% detection rate and 97.65% 𝐹
1 score. ROGONG-IDS
provides a robust, efficient solution for IDS built on datasets with the imbalanced class
distribution. It outperforms state-of-the-art and traditional intrusion detection methods.
Key words: Machine learning, Cyber security, Intrusion detection system, Imbalanced
data, Gradient boosting.
iii
SALDIRI TESPİT SİSTEMLERİ İÇİN DENGESİZ VERİ
SETLERİNDE SMOTE VE NEAR MISS METOTLARINA
DAYALI GÜÇLÜ GRADYAN ARTIRMA MODELİ
ÖZET
Yeni teknolojiler birçok güvenlik açığına ve sıfırıncı gün saldırı risklerine neden
olmaktadır. Saldırı tespit sistemleri, bilgisayar ağlarını tehdit ve saldırılardan korumak
için geliştirilmiştir. Mevcut yöntemlerde çözülmesi gereken birçok zorlu problem
vardır. Sınıf dengesizliği problemi karşılaşılan en zorlayıcı problemlerden birisidir ve
saldırı tespit sistemlerinde sınıflandırıcıların tespit oranını düşürmektedir.
Literatürdeki en yüksek IDS saldırı tespit oranı 96.54%’tür. Bu tezde Gradyan
Arttırma temelli ROGONG-IDS (Robust Gradient Boosting) olarak adlandırılan bir
model sunulmaktadır. ROGONG-IDS modeli, sınıf dengesizliğini ele almak için
Sentetik Azınlık Aşırı Örnekleme Tekniği (SMOTE) ve Near Miss metotlarını
kullanmaktadır. Gradyan arttırma tabanlı üç farklı sınıflandırma algoritması (GBM,
LightGBM, XGBoost) karşılaştırıldı. Önerilen modelin çok sınıflı sınıflandırma
üzerindeki performansı, UNSW-NB15 veri seti üzerinde test edilmiştir. ROGONG-
IDS, 97.30% tespit oranı ve 97.65% 𝐹
1 skoru ile literatürdeki en yüksek saldırı tespit
oranı ve 𝐹
1 skoruna erişti. ROGONG-IDS, dengesiz sınıf dağılımına sahip veri
kümeleri üzerine kurulmak istenen saldırı tespit sistemleri için sağlam, verimli bir
çözüm sunar. Önerilen bu modelin son teknoloji ve geleneksel yöntemler oluşturulmuş
saldırı tespit sistemlerinden daha iyi performans sergilediği görülmüştür.
Anahtar Kelimeler: Makine öğrenmesi, Siber güvenlik, Saldırı tespit sistemi,
Dengesiz veri, Gradyan arttırma.
iv
ACKNOWLEDGEMENTS
I would first like to thank my thesis advisor Asst. Prof. Gülsüm Çiğdem Çavdaroğlu
Akkoç. The door to Prof. Çavdaroğlu's was always open whenever I ran into a trouble
spot or had a question about my research or writing. She consistently allowed this
paper to be my work but guided me in the proper direction whenever he thought I
needed it.
I would also like to thank Asst. Prof. Ali Cihan Keleş for his friendship and support,
my colleague Mehmet Ali Özer for his support.
Finally, I must express my very profound gratitude to my parents and my girlfriend
Beyza for providing me with unfailing support and continuous encouragement
throughout my years of study and through the process of researching and writing this
thesis. This accomplishment would not have been possible without them. Thank you.
Ahmet Okan ARIK
v
TABLE OF CONTENTS
APPROVAL PAGE .................................................................................................... i
ABSTRACT ................................................................................................................ ii
ÖZET .......................................................................................................................... iii
ACKNOWLEDGEMENTS ...................................................................................... iv
TABLE OF CONTENTS ........................................................................................... v
LIST OF TABLES ................................................................................................... vii
LIST OF FIGURES ................................................................................................ viii
LIST OF ABBREVIATONS .................................................................................... ix
CHAPTER 1 ............................................................................................................... 1
1. INTRODUCTION ................................................................................................... 1
CHAPTER 2 ............................................................................................................... 3
2. LITERATURE REVIEW......................................................................................... 3
2.1 Thesis Main Domain ...................................................................................... 3
2.2 Related Work ................................................................................................. 4
2.3 Contributions ................................................................................................ 10
CHAPTER 3 ............................................................................................................. 11
3. METHOD ............................................................................................................... 11
3.1 Description of UNSW-NB15 Dataset .......................................................... 12
3.2 Data Preprocessing ....................................................................................... 15
3.3 Handling Imbalance Data ............................................................................. 16
CHAPTER 4 ............................................................................................................. 20
4. EXPERIMENTAL ANALYSIS ............................................................................ 20
4.1 Evaluation Metrics ....................................................................................... 20
4.2 Hyper-parameter Optimization .................................................................... 21
4.3 Multiclass Classification .............................................................................. 22
CHAPTER 5 ............................................................................................................. 27
5. CONCLUSION AND FUTURE WORK............................................................... 27
REFERENCES ......................................................................................................... 29
vi
APPENDIX ............................................................................................................... 33
APPENDIX A) SOURCE CODE ............................................................................ 33
RESUME ................................................................................................................... 34
vii
LIST OF TABLES
Table 3.1 Attack class distributions. .......................................................................... 13
Table 3.2 Selected features of the UNSW-NB15 according to DAE. ....................... 15
Table 3.3 Studied undersampling methods. ............................................................... 16
Table 4.1 Test environment....................................................................................... 20
Table 4.2 XGBoost hyperparameters. ........................................................................ 22
Table 4.3 Multiclass classification perfomance comparison between LightGBM, GBM
and XGBoost ............................................................................................... 22
Table 4.4 Comparison multiclass classification results with advanced methods on the
UNSW-NB15 dataset. ................................................................................. 24
viii
LIST OF FIGURES
Figure 2.1 Methods used in developing IDS models. .................................................. 4
Figure 3.1 Architecture of ROGONG-IDS. ............................................................... 12
Figure 4.1 Comparison of gradient boosting methods on ROGONG-IDS. ............... 24
Figure 4.2 Comparison multiclass classification results with advanced methods using
UNSW-NB15 dataset in the literature. ........................................................ 25
ix
LIST OF ABBREVIATONS
Accuracy: ACC .......................................................................................................... 19
Anomaly detecon based IDS: AIDS ............................................................................ 1
Artificial Neural Network: ANN ................................................................................. 4
Central Points: CP ........................................................................................................ 9
Classification And Regression Tree: CART ................................................................ 7
Convolutional Neural Network: CNN ......................................................................... 4
Deep Belief Networks: DBN ....................................................................................... 8
Deep Convolutional Neural Network: DCNN ............................................................. 6
Deep Neural Network: DNN ........................................................................................ 5
Denoising Autoencoder: DAE ................................................................................... 15
Detection Rate: DR ...................................................................................................... 8
False Alarm Rate: FAR ................................................................................................ 8
Feedforward Deep Neural Network: FFDNN ............................................................ 10
Gated Recurrent Unit: GRU ......................................................................................... 4
Gaussian Mixture Model: GMM .................................................................................. 8
Gradient Boosting Machine: GBM ............................................................................ 21
Host Intrusion Detection System: HIDS ...................................................................... 3
Improved One-vs-One: I-OVO .................................................................................... 9
Internet of Things: IoT ................................................................................................. 1
Intrusion Detection System: IDS ................................................................................. 1
Intrusion Prevention System: IPS ................................................................................ 1
K-Nearest Neighborhood: K-NN ................................................................................. 5
Long Short-Term Memory: LSTM .............................................................................. 6
Network Access Controller: NAC ............................................................................... 1
Network Intrusion Detection System: NIDS................................................................ 3
Open System Interconnection: OSI .............................................................................. 4
Particle Swarm Optimization: PSO .............................................................................. 8
Random Over-Sampling: ROS ..................................................................................... 9
x
Robust Gradient Boosting IDS: ROGONG-IDS ......................................................... 2
Self-Taught Learning: STL .......................................................................................... 8
Signature-based IDS: SIDS .......................................................................................... 1
Support Vector Machine: SVM ................................................................................... 4
The Australian Center for CyberSecurity: ACCS ...................................................... 12
1
CHAPTER 1
1. INTRODUCTION
Cyber attack means destroying the triad of confidentiality, integrity, or
availability, called the CIA triad. Many tools have been developed to combat
cyberattacks, such as firewalls, anti-virus, Network Access Controllers (NAC), end-
point security, Intrusion Prevention System (IPS), and Intrusion Detection System
(IDS). An IDS is cyber security software that monitors the host or network to identify
cyber-attacks. The continuous development of the Internet of Things (IoT), Industry
4.0, Cloud Computing, and Big Data technologies have increased the number of
devices connected to the networks strikingly (Yang, Zheng, Wu, Yang, and Wang,
2020) and continues to raise its speed. Swiftly expanding accessibility and smart
devices have led to an increment in cyber attacks. This increase in cyber attacks has
reinforced the importance of IDS more than ever before.
IDS are divided into two different groups according to the detection method: (1)
Signature-based IDS (SIDS), (2) Anomaly detection based on IDS (AIDS). While
SIDS is based on marking all abnormal behavior for an entity, AIDS is a class of
marking that is close to some predefined model signature of the entity (Axelsson,
2000). SIDS achieves high detection rates in known attack types because signatures
are available for these attack types. However, this method is unable to identify new
attack types because there are no signature patterns for these attack types (Kabiri and
Ghorbani, 2005). In addition,
2
a large database of signatures is kept, and the incoming signatures are compared
with the signatures in the database, which cannot be a resource-friendly application.
(Uddin et al., 2013) Any activity other than the normal profile created in AIDS is called
anomaly or abnormal behavior. The benefits of this method are that it can detect
unknown attacks and new attack types and is proper for different networks and
applications with a customizable normal activity profile. (Guo, Ping, Liu, and Luo,
2016) An imperfection of AIDS is that it perceives any deviation from the baseline as
an attack, causing the system's unpredictable behavior to be labeled as an attack. This
issue leads to a high false-positive rate.
In this thesis, a model named ROGONG-IDS (Robust Gradient Boosting IDS),
which is an AIDS, is proposed. This model consists of (1) preprocessing module, (2)
handling imbalance data module where two-stage data resampling is performed, and
(3) classification decision module. The number of 47 independent variables in the
UNSW-NB15 dataset was reduced to 12 after feature selection method is created
Zhang et al. by using Denoising Autoencoder (DAE). Three different gradient
boosting-based classifiers (LightGBM, GBM, XGBoost) were tested in the
classification decision module after one-hot encoding, label encoding, and data
standardization, and XGBoost, which provides the most successful result in classifying
attacks successfully, was chosen as the classifier method of the ROGONG-IDS model.
Section 2 includes literature review, Section 3 method, Section 4 experiments and
results, and Section 5 includes conclusion and future work.
3
CHAPTER 2
2. LITERATURE REVIEW
2.1 Thesis Main Domain
IDS consists of two types: (1) Network Intrusion Detection System (NIDS), (2)
Host Intrusion Detection System (HIDS). NIDS is positioned inside the network to
keep track of all traffic on the network. It examines the incoming packages within the
scope of the developed IDS model. When an attack or abnormal movement is detected,
it alerts the administrator. HIDS is a type of IDS that monitors processes and
applications on workstations or servers. HIDS monitors essential system configuration
files, log and content files, registry files and reports any unauthorized or abnormal
behavior. In this thesis, a NIDS that monitors incoming network packets is studied.
AIDS is being developed by machine learning and deep learning methods, as
seen in Figure 2.1 provided. Deep learning methods have started to be used frequently
in the development of IDS models today, with the ability to handle the feature
engineering feature independently and produce more successful results in high-volume
data. Machine learning algorithms used in this domain cannot be left behind due to the
mentioned features of deep learning methods. ROGONG-IDS proposed in this thesis
achieved the most successful multiclass classification results in the literature with
gradient boosting methods.
4
Figure 2.1 Methods used in developing IDS models.
2.2 Related Work
Ming, Zhou, and Chen (2021) designed a sequential model for the IoT system
IDSs using deep learning methods. The parameters used for the created model are
obtained from the packets at the network layer in the Open System Interconnection
(OSI) architecture via Tcpdump. Text-Convolutional Neural Network (CNN) and
Gated Recurrent Unit (GRU) algorithms are preferred for the sequential-based model.
It is aimed to reach a higher 𝐹
1 score which is proper measurement to evaluate model
success on imbalanced data set as it could extract new parameters from the data. In the
study in which the KDD99 dataset was used for testing, the generated model was
compared with the Support Vector Machine (SVM), Naive Bayes, and C4.5
algorithms. The studied algorithms outperformed traditional machine learning models
whose 𝐹
1 score performance was compared on multiclass classification. The model
achieved an 𝐹
1 score performance of over 90% in the attack classes in the KDD99
dataset.
Thaseen, Banu, and Lavanya, Ghalib, and Abhishek (2021) used the Artificial
Neural Network (ANN) to devise the IDS model. The feature selection process was
applied by studying the correlation between the features, and the model was built with
features that correlate 0,5. UNSW-NB15 and KDD99 datasets were used for testing.
The overall accuracy was 96,44% for the UNSW-NB15 dataset and 98,45% for the
KDD-99 dataset. They stated that traditional machine learning methods are insufficient
to handle extensive network data and adopted the use of neural networks.
ROGONG-IDS has reduced the training and testing time by creating a model
with high impact and numerically fewer features in the feature selection part. In
addition, ROGONG-IDS offers a two-stage imbalance data solution for the
5
imbalanced data problem, one of IDS models' most significant problems and is not
addressed in the mentioned study.
Mulyanto, Faisal, Prakosa, and Leu (2021) developed a model that solves the
imbalanced data problem that needs to be solved in IDS models with the Focal Loss
method (FL-NIDS). This method, which was studied with Deep Neural Network
(DNN) and CNN architectures, was examined on three benchmark IDS datasets, NSL-
KDD, UNSW-NB15, Bot-IoT, and it was observed that the detection rate increased as
the number of layers increased in these two architectures. Since the accuracy score
does not reflect the detection rate of the minority classes, the evaluation was made
with the 𝐹
1 score. In the UNSW-NB15 dataset, it was seen that the CNN-SMOTE
model reached a 36% 𝐹
1 score, while FL-NIDS reached a 39% 𝐹
1 score.
This study, which has a low 𝐹
1 score according to ROGONG-IDS, could not be
said to have successfully solved the problem of imbalanced data. In addition, the
study's lack of data standardization and feature selection methods resulted in low
classification scores.
Vigneswaran, Vinayakumar, Soman, and Poornachandran (2018) used 3-layer
DNN for their IDS model. Binary classification model using KDDCup-99 data set;
Compared with Ada Boost, Decision Tree, K-Nearest Neighborhood (K-NN), Linear
Regression, Naive Bayes, Random-Forest, SVM methods. One layer DNN
architecture has been shown to achieve 92,9% accuracy, 95,4% 𝐹
1 score and
outperform traditional machine learning methods.
In this study, some of the reasons for obtaining low binary classification scores
were not to solve the imbalanced data problem and feature selection methods.
ROGONG-IDS offers a robust solution for imbalanced data.
Kaja, Shaout, and Ma (2019) used a two-stage method for attack classification
in the IDS model. The model performs the detection process with K-Means on the first
packet coming from the network, and then the classification process is performed. This
method is aimed to reduce the false-positive rate. This study using the KDD dataset
reached 99,97% accuracy, which is the most successful score in the literature in this
dataset.
Yin, Zhu, Fei, and He (2017) studied the RNN and Multilayer Perceptron method
to generate IDS. In the NSL-KDD dataset study, a comparison was made with
traditional machine learning methods. The study reached 81,29% accuracy with RNN
and 78,10% accuracy with Multilayer Perceptron (MLP).
6
This study uses an old dataset and does not discuss NSL-KDD, imbalanced data,
data standardization, and feature selection methods. Nevertheless, ROGONG-IDS has
achieved high accuracy, 𝐹
1 scores using unique and new methods..
Naseer et al. (2018) examined IDS models' CNN, RNN, and Autoencoders
architectures. They found that the deep convolutional neural network (DCNN) and
Long Short-Term Memory (LSTM) were the most successful deep learning methods
for the NSLKDD dataset. DCNN reached 85% accuracy while LSTM reached 89%
accuracy.
Since the accuracy value does not reflect the detection rate of the minority
classes, the 𝐹
1 score criterion should also be evaluated. Imbalanced data, feature
selection, and data standardization are other methods that could increase the success
rate of this study.
Xu, Shen, Du, and Zhang (2018) used GRU and LSTM algorithms in IDS
models. This study used KDD99 and NSL-KDD datasets, and they reached 99,42%
accuracy in the KDD99 dataset and 99,31% accuracy in the NSL-KDD dataset.
Another significant result is that GRU outperforms the LSTM algorithm for the
aforementioned datasets.
In their study, Jiang, Wang W., Wang A., and Wu (2020) stated that the
imbalanced class problem causes a high false detection rate. Therefore, they propose
an IDS model combined with hybrid sampling and a deep hierarchical network to
avoid this. To solve the imbalanced data problem, One-Side-Selection (OSS) is used
to reduce noise samples in majority classes and then SMOTE method is used to
increase samples of minority classes. The proposed model was tested on NSL-KDD
and UNSW-NB15 datasets, and accuracy values of 83,58% and 77,16% were
achieved, respectively.
In this study, the imbalanced data problem was solved with a two-stage method.
Although the created model has a complex structure, some of its low accuracies are
that data standardization and feature selection were not performed.
Yang et al. (2020) propose an IDS model called SAVAER-DNN. They state that
the minority classes should be increased to solve the imbalanced data problem, and the
data augmentation method used shows a successful performance than other well-
known methods. The number of observations of the minority classes has been
increased up to the class with the most samples in the majority class. They compared
the created model with other oversampling methods and classification algorithms. The
7
proposed model reached 93,01% accuracy, while Random Over Sampling (ROS) -
DNN 81,45%, Synthetic Minority Over-sampling Technique (SMOTE) - DNN
81,94%, Adaptive Synthetic (ADASYN) - DNN 81,76% accuracy.
In this study, it could be said that increasing the number of observations of the
minority class increases the training time. The long training time is unsuitable for
implementing the IDS model in real-time network environments. On the other hand,
ROGONG-IDS is suitable for real-time network environments due to its short
execution time.
Belouch, Hadaj, and Idhammad (2018) used Apache Spark, a big data processing
engine, to preprocess the IDS model they created. They reduced the number of 49
features to 42 by applying the feature selection method. SVM, Naive Bayes, Decision
Tree, Random Forest algorithms have been tested. Random Forest achieved the best
binary classification result with 97,49% accuracy.
For real-time network implementation, execution time needs to be reduced. This
process requires a more extensive feature selection process. In the ROGONG-IDS
feature selection part, the number of features has been reduced to 12. In this way,
execution time has decreased significantly.
Chkirbene, Eltanbouly, Bashendy, AlNaimi, and Erbad (2020) implemented a
comprehensive feature selection on the IDS model. The number of features has been
reduced to 13 with the Random Forest algorithm. Classification And Regression Tree
(CART) is used as the classification method. UNSW-NB15 and KDD99 datasets were
used to test the success of the model. In multiclass classification; They reached 95,73%
accuracy in the UNSW-NB15 dataset and 97,03% accuracy in the KDD99 dataset.
Using over-sampling and random sampling methods to solve the imbalanced
class problem can achieve higher accuracy on the accuracy value.
Vinayakumar et al. (2019) propose an IDS model with DNN architecture using
various layers from one to five. They tested the model created on UNSW-NB15,
KDDCup99, NSL-KDD, WSN-DS, CICIDS 2017 datasets. They achieved the most
successful results for the UNSW-NB15 dataset with a one-layer DNN architecture. In
this dataset, 78% accuracy score and 82% 𝐹
1 score were achieved in binary
classification. In multiclass classification, they reached 64% accuracy.
In the mentioned study, to increase the low accuracy in multiclass classification
and reduce the execution time, it is necessary to solve the imbalanced data problem
and use feature selection and data standardization methods.
8
Zhang, Huang, Wu, and Li (2020) propose an IDS model called SGM-CNN.
This model offers a two-stage imbalanced data solution. First, the sample number of
the minority classes is equal to the 𝐼𝑟𝑒𝑠𝑎𝑚𝑝𝑙𝑒 value we included in our study with the
SMOTE method. In the second stage, the Gaussian Mixture Model (GMM) involves
equating the sample number of the majority classes to the 𝐼𝑟𝑒𝑠𝑎𝑚𝑝𝑙𝑒 value. This study,
which proposes the CNN architecture, reached 96,54% accuracy in multiclass
classification and 98,82% accuracy in binary classification.
Lee, Amaresh, Green, and Engels (2018) examined deep learning approaches on
IDSs in their studies. Vanilla DNN, Self-Taught Learning (STL), and LSTM-based
RNN algorithms were compared on the KDD Cup 99 dataset. According to the results,
it is known that Autoencoder reaches 98,9% accuracy and LSTM reaches 79,2%
accuracy.
Al-Yaseen, Othman, and Nazri (2017) propose a multi-level hybrid IDS model
to understand whether incoming network packets are standard or attack packages. K-
Means was used to reduce the dataset by 10 percent. SVM classifier is used on the
reduced dataset. It was stated that it achieved 95,75% accuracy on the KDD Cup 1999
dataset and outperformed the other methods compared.
Elmasry, Akbulut, and Zaim (2020) state that the most critical problems of IDS,
false alarm rate (FAR), and low detection rate (DR), are datasets consisting of
irrelevant features. Particle Swarm Optimization (PSO) based method has been
proposed to overcome this problem. It aims to tune hyperparameters automatically and
define feature subsets with this method. They evaluated this proposed method with
three different deep learning methods, DNN, LSTM, and Deep Belief Networks
(DBN), respectively. NSL-KDD and CICIDS2017 datasets were used for testing in the
study. According to the results, the proposed method increases the detection rate by
4%-6% and reduces the FAR by 1%-5%. The highest success in multiclass
classification was achieved with the DBN 86,53% accuracy score in the NSL-KDD
dataset. In the CICIDS-2017 dataset, the most successful result was again achieved
with DBN with an accuracy score of 82%.
Zhiqianq et al. (2019) state that traditional machine learning algorithms are not
sufficiently practical on IDSs. In this direction, they propose a deep learning model
tested with the UNSW-NB15 dataset. The proposed model is a 10-layer Feedforward
ANN model consisting of 100 neurons. It is stated that the proposed method
9
outperforms algorithms such as Logistic Regression, Naive Bayes, ANN. Furthermore,
the accuracy score is 99,5% in binary classification, and the FAR is 0,47.
Moustafa and Slay (2017) propose a hybrid feature selection method for feature
selection, which has an important place in developing the advanced IDS model. First,
it reduces processing time by selecting the most frequently used observations with
Central Points (CP). After this process, best-ranked features are obtained with
Association Rule Mining (ARM). Finally, irrelevant, noisy features are removed. In
the evaluation made on binary classification in UNSW-NB15 and NSL-KDD dataset,
it was seen that this feature selection method performed more successfully in the
UNSW-NB15 dataset.
Shende and Thorat (2020) state that more effective deep learning methods should
be used instead of traditional machine learning methods in their study. Using the NSL-
KDD dataset, IDS models were created with MLP, LW-MLP, ANN, CNN methods.
They achieved accuracy values of 97,79% with CNN and 97,14% with MLP.
Gupta, Jindal, and Bedi (2021) propose a model called LIO-IDS. This model
uses LSTM classifier and Improved One-vs-One (I-OVO) techniques. Consisting of
two layers, the LIO-IDS anomaly is a one-based IDS. In the first layer, packets are
detected as an attack or normal with LSTM, and in the second layer, the ensemble
method is used to classify attacks. I-OVO, used for multiclass classification in the
second layer, differs from the traditional OVO method by using only three classifiers
for each observation, thus reducing the test time. Over-sampling methods SVM-
SMOTE, Borderline-SMOTE, and Random Over-Sampling (ROS) were used to
improve detection in the second layer. NSL-KDD, CIDDS-001, and CICIDS2017
datasets were used for evaluation. Looking at the results, it has been determined that
the proposed LIO-IDS model makes a significant difference from other IDS models.
It has been stated that it is suitable for real-world deployment with its high DR and
short computational time. LIO-IDS achieved 87% accuracy in the NSL-KDD dataset,
96% accuracy in the CIDDS-01 dataset, and 86% accuracy in the CICIDS2017 dataset.
Kasongo and Sun (2019) state that IDSs get weaker and heavier as the feature
space grows. They propose a Feedforward Deep Neural Network (FFDNN) to solve
this problem for wireless IDSs and, with it, a filter-based feature selection method.
Compared with SVM, Decision Tree, K-NN, and Naive Bayes methods using NSL-
KDD dataset. The proposed model outperformed these methods and reached 86.19%
accuracy on multiclass classification.
10
Kang M-J. and Kang J-W. (2016) propose a new IDS model for vehicular
networks using DNN. The proposed IDS model monitors the packets broadcast in the
Controller Area Network and tests an attack. Probability-based feature vectors were
trained using DBN. The experiments carried out can provide real-time response to
network traffic with an accuracy score of 98%.
Liu, Gao, and Hu (2021) propose an IDS model that addresses the imbalanced
data problem. An ensemble model is proposed to solve the Imbalanced data problem.
This model uses ADASYN for oversampling and LightGBM for classification. After
performing the normalization process on the data, experiments were performed on
KDD, UNSW-NB15, CICIDS2017 datasets. The model, which offers a more
prosperous and shorter training time than other IDS models, reached 85.89% accuracy
in the UNSW-NB15 dataset.
Khan, Gumaei, Derhab, and Hussain (2019) suggest a two-stage IDS model.
First, this model detects that network packets are not normal or abnormal based on the
probability score generated by the Stacked Auto-Encoder. In the second stage, attacks
are classified using the Softmax classifier. Thus, the proposed system can classify
unlabeled data. The model, evaluated with different algorithms, reached 89,13%
accuracy, 0,74 FAR in the UNSW-NB15 dataset.
2.3 Contributions
This thesis proposes an IDS model called ROGONG-IDS, which has the highest
accuracy and 𝐹1 score in the literature. Furthermore, as seen in the literature review, it
may help to include gradient boosting methods and resampling studies, which are
rarely used in IDS models, frequently in IDS development. Thus, it may be possible
to examine the development of AIDS in a wide range with this thesis.
11
CHAPTER 3
3. METHOD
ROGONG-IDS architecture consists of three modules as shown in Figure 3.1:
(1) Data preprocessing module, (2) imbalanced data handling module, and (3)
classification decision module. The data preprocessing module aims to make the data
proper for modeling. Operations performed in the data preprocessing module are: (a)
one-hot encoding and label encoding transactions to process categorical data, (b)
feature selection process to reduce model training time and increase accuracy by
eliminating redundant features, and (c) data standardization process to examine many
different types of measurable features in a common standard. The imbalanced data
handling module involves resampling operations to ensure class balance. The two-
stage resampling method uses the Near Miss method for undersampling and SMOTE
for oversampling. This method is the most critical factor in increasing the model
accuracy. The SMOTE method was used to increase the number of sample minority
classes, whereas the Near-Miss method was used to reduce bias by undersampling
majority classes. Finally, the XGBoost algorithm based on Gradient boosting, whose
hyperparameters were optimized using Bayesian optimization, was used in the
classification decision stage. Then it was experimented with in the UNSW-NB15
dataset to consider the model on the networking environment.
12
Figure 3.2 Architecture of ROGONG-IDS.
3.1 Description of UNSW-NB15 Dataset
IDS studies suffer from the unavailability of data sets without structured network
information, which does not cover current network traffic scenarios. Datasets such as
KDD98, KDDCUP99, NSLKDD, which are still used to evaluate the created IDS
models, do not reflect todays modern network traffic scenarios. The Australian Center
for CyberSecurity (ACCS) research group developed the UNSW-NB15 dataset
(Moustafa and Slay, 2015). The research group combined the current standard network
data and synthetic attack data to generate this dataset. It is a deep, comprehensive
dataset that contains 2.54 million rows of traffic data and nine attack types and can
reflect todays modern network traffic scenarios. In the dataset, which has 49 features,
two features are class labels. There is a high level of class imbalance in the data set.
Regular traffic is 87,35%, and attack traffic is only 12,65%. The whole dataset was
used for the modeling, and it was divided into training and testing at a ratio of 7:3. It
could be examined the attack class distributions in the dataset in detail in Table 3.1
below.
13
Table 3.1 Attack class distributions.
Class
Description
Training set
- size
Test set -
size
Total
Analysis
Port-based
attack for web
applications.
1.874
803
2.677
Backdoor
Penetration
remote attack to
obtain
unauthorized
access to a
system.
1.630
699
2.329
DoS
An attack that
aims to disrupt
the services of
the system
temporarily or
indefinitely.
11.447
4.906
16.353
Exploits
A penetration
attack that aims
to exploit a bug
or vulnerability
through code.
31.167
13.358
44.525
Fuzzers
A type of attack
that scans the
target system for
information
using software
testing
technique.
16.972
7.274
24.246
14
Table 3.2 Attack class distributions. (cont.)
Generic
A “generic
attack” toward a
cryptographical
primitive is one
that can be run
separately from
the
circumstances of
how that
cryptographical
primitive is
implemented.
150.837
64.644
215.481
Normal
Real transction
data.
1.553.134
665.630
2.218.764
Reconnaissance
Reconnaissance
attacks are
information-
gathering
attacks.
9.791
4.196
13.987
Shellcode
Shellcode is a
collection of
directions that
performs a
command in
software to gain
control of or
exploit a
compromised
machine.
1.058
453
1.511
15
Table 3.3 Attack class distributions. (cont.)
Worms
The worm is
software
carrying
malicious code
that attacks host
machines and
lies via a
network.
122
52
174
Total
10 classes.
1.778.032
762.015
2.540.047
3.2 Data Preprocessing
Feature selection, one-hot encoding, label encoding, and data standardization
processes were carried out in this phase.
DAE model (Zhang et al., 2018) was used in the feature selection stage. DAE
aims to reduce feature dimensionality by specifying a limited number of critical
features. The feature size was limited, with 12 features determined as a result of DAE.
Table 3.2 shows these 12 selected features. Feature selection was performed at the
beginning of the process to accelerate the preprocessing phase.
Table 3.4 Selected features of the UNSW-NB15 according to DAE.
UNSW-NB15 dataset has three nominal data types. These properties are:
"proto", "state", "service". Respectively, each attribute has 135, 16, 14 different values.
One-hot encoding technique is used to process these features by machine learning
algorithms while preserving the irregular relationship. With this process, the number
of features in the dataset increased from 47 to 208. Label encoding, another type of
Stcpb
Service__-
Dload
Dmeansz
Service_dns
Sload
Trans_depth
Sttl
Service_ftp-
data
Ct_ftp
16
encoding, was implemented on the target feature attack class. Data standardization, the
last method implemented in the data preprocessing stage, is used to bring data into a
common format. The common format allows to increase the accuracy value of the
model and to perform analytical studies on multidimensional data sets. According to
the Gaussian distribution, all features were standardized using the formula in Eq. (1),
with a mean of 0 and a standard deviation of 1. The data standardization formula is
provided below in Eq. (1):
3.3 Handling Imbalance Data
As seen in Table 3.1, there is a terrific imbalance between target variable classes.
As there are two classes with less than 2000 instances in 2.54 million rows
oversampling for minority classes or undersampling for majority classes alone will not
be sufficient. There may be situations such as removing useful information and
generating new observations that will increase the cost. A process is proposed that
recommends the use of both methods in the ROGONG-IDS model. As you see in
Table 3.3, 10 different methods were tried to find the most successful undersampling
method for the UNSW-NB15 dataset with XGBoost algorithm and SMOTE for
oversampling. The near Miss (v1) method gave the most successful results for
undersampling, and SMOTE was used in this stage.
Table 5.3 Studied undersampling methods.
Undersamplin
g Method
Accurac
y
𝑭𝟏 𝐬𝐜𝐨𝐫𝐞
Detectio
n Rate
Algorith
m
Oversamplin
g Method
AIIKNN
96,05%
96,77%
96,05%
XGBoost
SMOTE
Edited Nearest
Neighbours
96,26%
96,89%
96,26%
XGBoost
SMOTE
Repeated Edited
Nearest
Neighbours
96,08%
96,78%
96,08%
XGBoost
SMOTE
17
Table 6.3 Studied undersampling methods. (cont.)
Instance
Hardness
Threshold
94,20%
95%
94,20%
XGBoost
SMOTE
Near Miss
Undersampling
(v1)
96,49%
97,10%
96,49%
XGBoost
SMOTE
Near Miss
Undersampling
(v3)
95,45%
96,41%
95,45%
XGBoost
SMOTE
Neighbourhood
Cleaning Rule
96,03%
96,64%
96,03%
XGBoost
SMOTE
Random
Undersampling
96,22%
96,97%
96,22%
XGBoost
SMOTE
Tomek Link
96,24%
96,89%
96,24%
XGBoost
SMOTE
One Sided
Selection
95,44%
96,41%
95,44%
XGBoost
SMOTE
ROGONG-IDS uses a technique combining the SMOTE and Near Miss methods
that resamples all classes with an equal sample count called 𝐼𝑟𝑒𝑠𝑎𝑚𝑝𝑙𝑒
(Abdulmuhammed, Musafer, Alessa, Faezipour and Abuzneid, 2019) to handle the
imbalanced class distribution. Explanation of 𝐼𝑟𝑒𝑠𝑎𝑚𝑝𝑙𝑒 is provided in Eq. (2):
ROGONG-IDS uses SMOTE (Chawla, Bowyer, Hall, and Kegelmeyer, 2002)
to oversample classes with less than 𝐼𝑟𝑒𝑠𝑎𝑚𝑝𝑙𝑒 . One of the most commonly used
oversampling methods, SMOTE, increases minority class instances by synthesizing.
The reason for its success is that it uses the synthesis method. Instead of copying
samples from the data set, this method generates samples not in the data set. Thus, the
overfitting problem caused by the random oversampling method is avoided. Instead,
18
focusing on the feature space, SMOTE draws a line between the existing minority class
instances and places the synthetic data generated with the help of interpolation on this
line.
ROGONG-IDS uses the Near Miss undersampling method (Zhang and Mani,
2003) for classes with more samples than the 𝐼𝑟𝑒𝑠𝑎𝑚𝑝𝑙𝑒 value. Near Miss works by
selecting samples based on their distance from the majority class to the minority class.
It uses the Euclidian distance or similar as the distance measurement. ROGONG-IDS
uses version 1 of Near Miss, which has three different versions. This version balances
classes by keeping the average closest majority class samples to the three closest
minority class samples. It significantly increases the detection rate of minority classes.
The method ROGONG-IDS uses to handle imbalanced data differs from the
undersampling method used in the two-stage SGM (Zhang et al., 2020) method.
Algorithm 1 provide the pseudocode of ROGONG-IDS handling imbalanced data
method.
Algorithm 1. Method of handling imbalanced data on ROGONG-IDS.
3.4 Extreme Gradient Boosting: XGBoost
The XGBoost algorithm (Chen and Guestrin, 2016) is an enhanced version of
the Gradient boosting method for decision trees. Chen and Guestrin (2016) aimed to
19
scalability on tree boosting systems, use computational resources effectively and
improve model performance in classification and regression problems. Boosting is an
ensemble technique in which new models are added to fix the errors of the existing
model. New models are added iteratively until no new improvement is seen. Gradient
boosting is an algorithm used to estimate the residuals of previous models and make
the final estimation. It uses the gradient descent algorithm to minimize the loss of the
new model. Used heavily to provide state-of-the-art results for classification and
regression work, XGBoost was seen winning 17 of 29 machine learning tasks
published on Kaggle by 2015 (Ogunleye and Wang, 2020) .
20
CHAPTER 4
4. EXPERIMENTAL ANALYSIS
ROGONG-IDS model implementation was tested on the UNSW-NB15 dataset
to measure the accuracy of detections. It was used Macbook Pro with a macOS
Monterey operating system during the implementation. The test environment is
provided in Table 4.1.
Table 7.1 Test environment.
Project
Environment / Version
Operating System
macOS Monterey
CPU
1,4 GHz Quad-Core Intel Core i5
GPU
Intel Iris Plus Graphics 645
Memory
16 GB
4.1 Evaluation Metrics
Accuracy (ACC), DR, FAR, F1 score, Precision, Recall, indicators, which are
frequently used in imbalance class classification evaluation, were used during the
experiments. For each attack class, the samples considered as attacks were accepted
positive and the others negative. ACC represents the percentage of correctly classified
samples among all samples. DR is the rate of correctly predicted positive samples.
21
This ratio shows the success of ROGONG-IDS to detect various attack types. FAR is
defined as the proportion of negative samples falsely evaluated as positive. Recall,
which is DR, relates the ratio of correctly predicted positive class samples to the total
number of positive class samples. Precision means how many of the samples predicted
to be positive are positive samples. 𝐹
1 score is the harmonic average of Precision and
Recall. In multiclass classification, each class is calculated using a weighted average
method based on the number of samples in the category to understand the detection
performance of the model on imbalanced data. Eqs. (3-7) shows these quality
measures.
(1)
(2)
(3)
(4)
(5)
TP/FP and TN/FN are the numbers of samples correctly and incorrectly
predicted to be positive and negative, respectively.
4.2 Hyper-parameter Optimization
Hyperparameters are high-impact parameters that control the learning process of
the model. Since they are tuneable, they play a role in reaching the maximum
performance of the model in a reasonable time. However, as the size of the processed
data increases, the cost of the hyperparameter optimization process increases.
Therefore, Grid Search and Random Search methods are cost-inefficient and
exhaustive when used with big data. ROGONG-IDS uses Distributed Asynchronous
Hyper-parameter Optimization (Hyperopt) (Bergstra, Yamins, and Cox, 2013) for
22
hyperparameter tuning. Hyperopt was developed to automate hyperparameter
optimization based on Bayesian optimization. Hyperopt uses Bayesian optimization to
define and narrow the search space and maximize the probability function. ROGONG-
IDS model accuracy increased from 96,49% to 97,30% after using Hyperopt within a
fair amount of time. Table 4.2 shows the XGBoost hyperparameters used after
Hyperopt.
Table 8.2 XGBoost hyperparameters.
Parameters
Value
Learning Rate
0,5
Number of Estimators
5000
Max Depth
36
Colsample Bytree
0,61
Min Child Weight
4
Subsample
0,9
4.3 Multiclass Classification
ROGONG-IDSs performance could be analyzed in Table 4.3, which shows the
DR results for each class. ROGONG-IDS essentially uses the XGBoost algorithm.
However, other gradient boosting-based algorithms have experimented with the base
method handling with the imbalanced data. Table 4.3 presents the performance results
of the LightGBM and Gradient Boosting Machine (GBM) algorithms. ROGONG-IDS
with XGBoost achieves the overall best performance in terms of DR, accuracy, and
𝐹
1 score of 97,30%, 98,16%, 97,65%, respectively.
Table 9.3 Multiclass classification perfomance comparison between
LightGBM, GBM and XGBoost.
23
Class
ROGONG
-
LightGBM
ROGONG
-
GBM
ROGONG
-
XGBoost
Analysis
0.84
0.67
0.31
Backdoor
0.23
0.11
0.26
DoS
0.06
0.13
0.47
Exploits
0.46
0.48
0.54
Fuzzers
0.66
0.73
0.70
Generic
0.97
0.97
0.98
Normal
0.99
0.99
0.99
Reconnaissance
0.81
0.81
0.77
Shellcode
0.88
0.55
0.53
Worms
0.83
0.00
0.83
DR
96.55
96.26
97.30
Accuracy
96.55
96.26
97.30
Precision
98.30
97.91
98.16
𝐹𝟏 Score
97.18
96.91
97.65
Train-Time (s)
15.08
4336.71
205.27
Test-Time (s)
2.2
0.73
0.81
The test results of the gradient boosting methods tried for the ROGONG-IDS model
are provided in Figure 4.1 summarily.
24
Figure 3.1 Comparison of gradient boosting methods on ROGONG-IDS.
Table 4.4 provides a performance comparison of advanced IDS models and
ROGONG-IDS. Although DR development was provided for many classes,
Analysis”, Backdoor”, DoS attack types remained below 50% DR. While the
SGM had a test time of 8 seconds, ROGONG-IDS made a significant improvement in
this regard, reducing the test time to 0,81 seconds.
Table 10.4 Comparison multiclass classification results with advanced methods on the
UNSW-NB15 dataset.
Class
SGM-
CNN
(Zhang
et al.,
2020)
Two
stage
DL
(Khan
et al.,
2019)
Hybrid
Machine
Learning
(Chkirbene
et al., 2020)
ICVAE-
DNN
(Yang,
Zheng,
Wu,
and
Yang,
2019)
ADASYN
+
LightGBM
(Liu et al.,
2021)
ROGONG
-
IDS
Analysis
0,27
0,01
0,00
0,15
-
0,31
Backdoor
0,51
0,00
0,6
0,21
-
0,26
DoS
0,39
0,00
0,8
0,8
-
0,47
Table 11.4 Comparison multiclass classification results with advanced methods on the
UNSW-NB15 dataset. (cont.)
25
Exploits
0,45
0,57
0,86
0,71
-
0,54
Fuzzers
0,67
0,40
0,53
0,35
-
0,70
Generic
0,97
0,61
0,97
0,96
-
0,98
Normal
0,98
0,82
0,80
0,81
-
0,99
Reconnaissance
0,82
0,24
0,79
0,80
-
0,77
Shellcode
0,88
0,00
0,51
0,92
-
0,53
Worms
0,83
0,00
0,59
0,79
-
0,83
DR (%)
96,54
63,27
78,65
95,68
-
97,30
Accuracy(%)
96,54
89,13
78,65
89,08
85,89
97,30
𝑭𝟏 Score (%)
97,26
90,85
78,65
90,61
-
97,65
Precision (%)
98,30
89,13
78,65
86,05
-
98,16
FAR
-
-
0,11
-
0,6
0,51
Train-Time (s)
47,22
-
-
-
-
205,27
Test-Time (s)
8,26
-
-
-
-
0,81
A summary comparison of ROGONG-IDS with advanced IDS methods in the
literature is provided in Figure 4.2.
Figure 4.2 Comparison multiclass classification results with advanced methods
using UNSW-NB15 dataset in the literature.
The experimental results show that ROGONG-IDS using the two-method
handling imbalance data module used with XGBoost significantly improves DR. In
26
Table 4.3, ROGONG-IDS is compared with other gradient boosting classifiers. It has
been determined that the XGBoost algorithm is more successful than other methods
(GBM, LightGBM). XGBoost provided higher DR than the other two classifiers in
attack types Backdoor, DoS, Exploits, Generic, Normal, Worms. When
examined in general, it provided more successful results than the other two algorithms
within the scope of DR, ACC, 𝐹
1 score, and test time.
As seen in Table 4.4, which includes the comparison of ROGONG-IDS with
other advanced IDS models in the literature, ROGONG-IDS is seen to be the most
successful IDS model in the literature in terms of DR, ACC, 𝐹
1 score, and test time.
ROGONG-IDS; It is more successful than others in detecting attack types such as
Analysis, DoS, Fuzzers, Generic, Normal, Worms, but in attack types
such as Backdoor, Exploits, Reconnaissance, Shellcode was found to be less
successful than other IDS models.
27
CHAPTER 5
5. CONCLUSION AND FUTURE WORK
Intrusion detection systems involve many challenges. The existence of
anomalies may be specific to each field, but new anomalies and threats are created in
complex ways by harmful actors in this domain. UNSW-NB15 data, which includes
modern attack types and offers many different network parameters, has been used for
this study to be suitable in a current network environment. However, more up-to-date
data sets should be provided to develop more robust IDS models in this domain.
Another difficulty is that attack packets in networks are less frequent than regular
packets. This causes a considerable imbalance data problem and increases the size of
the data to be used for modeling. This leads to an increase in the computing power and
time required to process the data.
In this thesis, the UNSW-NB15 dataset, which includes the most up-to-date and
modern scenarios and attack types in the literature, has been tested with classifiers
based on Gradient boosting. This evaluation has determined that the XGBoost
algorithm is more successful than other methods (GBM, LightGBM). The ROGONG-
IDS model was compared with five advanced IDS models in the literature using the
UNSW-NB15 dataset for testing. ROGONG-IDS with DR, ACC, and 𝐹
1 score
reached 97,30%, 97,30%, 97,65%, respectively. These results prove that the
ROGONG-IDS model is the most successful IDS model in the literature.
IDS studies have a different structure from the general anomaly detection and
classification problems due to the size and volume of the data they encounter. It tries
to handle streaming data. The proposed ROGONG-IDS model both solves the
imbalanced data problem and has a fast implementation time (205s training, 0,81s
test). Offering high success quickly, ROGONG-IDS is an efficient solution for real-
time intrusion detection applications.
28
The generated ROGONG-IDS model could be used in areas that have huge data
imbalance on streaming data. Accordingly, the two-stage imbalanced data handling
module successful results could be achieved in diverse areas such as; smart production
lines, autonomous drive, social network analysis, fraud detection, real-time stock
trading. As future work, it is planned to study the optimization of attack classes, which
ROGONG-IDS has difficulty in detecting, with the use of new reinforcement methods
and use Apache Spark, which is used to process large-scale data, to reduce the
implementation time.
29
REFERENCES
Abdulhammed, R., Musafer, H., Alessa, A., Faezipour, M. and Abuzneid, A. (2019)
Features Dimensionality Reduction Approaches for Machine Learning Based
Network Intrusion Detection. Electronics 2019, 8, 322.
Al-Yaseen, W. L., Othman, Z. A., and Nazri, M. Z. A. (2017) Multi-level hybrid
support vector machine and extreme learning machine based on modified K-
means for intrusion detection system. Expert Syst. Appl. 67, C (January 2017),
296303. DOI:https://doi.org/10.1016/j.eswa.2016.09.041
Andresini G., Appice, A., Mauro, N. D., Loglisci, C. and Malerba, D. (2020) Multi-
Channel Deep Feature Learning for Intrusion Detection, in IEEE Access, vol. 8,
pp. 53346-53359, 2020, doi: 10.1109/ACCESS.2020.2980937.
Axelsson, S. (2000) Intrusion Detection Systems: A Survey and Taxonomy. Technical
Report 99-15. Department of Computer Engineering, Chalmers University.
Belouch, M., El hadaj, S. and Idhammad, M. (2018) Performance evaluation of
intrusion detection based on machine learning using Apache Spark. Procedia
Computer Science. 127. 1-6. 10.1016/j.procs.2018.01.091.
Bergstra J., Yamins, D. and Cox, D.D. (2013) Making a Science of Model Search:
Hyperparameter Optimization in Hundreds of Dimensions for Vision
Architectures. To appear in Proc. of the 30th International Conference on
Machine Learning.
Chawla, N. V., Bowyer K. W., Hall, L. O. and Kegelmeyer, W. P. (2002) SMOTE:
Synthetic minority over-sampling technique, J. Artif. Intell. Res., 16 (1) ,
pp. 321-357
Chen, T. and Guestrin C. (2016) XGBoost: A Scalable Tree Boosting System, in
Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2016, pp. 785794, 2016,
https://arxiv.org/abs/1603.02754.
Chkirbene, Z., Eltanbouly, S., Bashendy, M., AlNaimi, N. and Erbad, A. (2020)
Hybrid Machine Learning for Network Anomaly Intrusion Detection, 2020
IEEE International Conference on Informatics, IoT, and Enabling Technologies
(ICIoT), 2020, pp. 163-170, doi: 10.1109/ICIoT48696.2020.9089575.
30
Elmasry, W., Akbulut, A., Zaim, A. H. (2020) Evolving deep learning architectures
for network intrusion detection using a double pso metaheuristic. Computer
Networks, 168, 107042. doi:10.1016/j.comnet.2019.107042
Guo, C., Ping, Y., Liu, N. and Luo, S. (2016) A two-level hybrid approach for intrusion
detection. Neurocomputing. 214. 10.1016/j.neucom.2016.06.021.
Gupta, N., Jindal, V. and Bedi P. (2021) LIO-IDS: Handling class imbalance using
LSTM and improved one-vs-one technique in intrusion detection
system,Computer Networks, Volume 192, 2021, 108076, ISSN 1389-1286,
https://doi.org/10.1016/j.comnet.2021.108076.(https://www.sciencedirect.com/
science/article/pii/S1389128621001675)
Jiang, K., Wang, W., Wang, A. and Wu, H. (2020) Network Intrusion Detection
Combined Hybrid Sampling With Deep Hierarchical Network, in IEEE Access,
vol. 8, pp. 32464-32476, 2020, doi: 10.1109/ACCESS.2020.2973730.
Kabiri, P. and Ghorbani, A. (2005) Research on Intrusion Detection and Response: A
Survey. International Journal of Network Security. 1. 84-102.
Khan, F. A., Gumaei, A., Derhab, A. and Hussain, A. (2019) A Novel Two-Stage Deep
Learning Model for Efficient Network Intrusion Detection, in IEEE Access, vol.
7, pp. 30373-30385, 2019, doi: 10.1109/ACCESS.2019.2899721.
Kasongo, S. M. and Sun, Y. (2019) A Deep Learning Method With Filter Based
Feature Engineering for Wireless Intrusion Detection System, in IEEE Access,
vol. 7, pp. 38597-38607, 2019, doi: 10.1109/
Kaja, N., Shaout, A. and Ma, D. (2019) An intelligent intrusion detection system.
Applied Intelligence. 49. 3235-3247. 10.1007/s10489-019-01436-1.
Kang, M-J., Kang, J-W. (2016) Intrusion Detection System Using Deep Neural
Network for In-Vehicle Network Security. PLOS ONE 11(6): e0155781.
https://doi.org/10.1371/journal.pone.0155781
Lee, B., Amaresh, S., Green, C. and Engels, D. (2018) Comparative Study of Deep
Learning Models for Network Intrusion Detection, SMU Data Science Review:
Vol. 1 : No. 1 , Article 8. Available at:
https://scholar.smu.edu/datasciencereview/vol1/iss1/8
Liu, J., Gao, Y. and Hu, F. (2021) A fast network intrusion detection system using
adaptive synthetic oversampling and LightGBM, Computers & Security,
Volume 106, 2021, 102289, ISSN 0167-4048,
https://doi.org/10.1016/j.cose.2021.102289.
Moustafa, N. and Slay, J. (2015) UNSW-NB15: a comprehensive data set for network
intrusion detection systems (UNSW-NB15 network data set), 2015 Military
Communications and Information Systems Conference (MilCIS), 2015, pp. 1-6,
doi: 10.1109/MilCIS.2015.7348942.
31
Moustafa, N. and Slay, J., (2017) A hybrid feature selection for network intrusion
detection systems: Central points. [online] arXiv.org. Available at:
<https://arxiv.org/abs/1707.05505> [Accessed 12 May 2021].
Mulyanto, M., Faisal, M., Prakosa S. W. and Leu, J-S. (2021) Effectiveness of Focal
Loss for Minority Classification in Network Intrusion Detection Systems.
Symmetry. 13(1):4. https://doi.org/10.3390/sym13010004
Naseer S., et al. (2018) Enhanced Network Anomaly Detection Based on Deep Neural
Networks, in IEEE Access, vol. 6, pp. 48231-48246, 2018, doi:
10.1109/ACCESS.2018.2863036.
Ogunleye, A. and Wang, Q-G. (2020) XGBoost Model for Chronic Kidney Disease
Diagnosis, in IEEE/ACM Transactions on Computational Biology and
Bioinformatics, vol. 17, no. 6, pp. 2131-2140, 1 Nov.-Dec. 2020, doi:
10.1109/TCBB.2019.2911071.
Shende, S. and Thorat, S. (2020) A Review on Deep Learning Method for Intrusion
Detection in Network Security, 2nd International Conference on Innovative
Mechanisms for Industry Applications (ICIMIA), 2020, pp. 173-177, doi:
10.1109/ICIMIA48430.2020.9074975.
Thaseen, S. I., Banu, J. S., Lavanya, K., Ghalib, R. M. and Abhishek, K. (2020) An
integrated intrusion detection system using correlation‐based attribute selection
and artificial neural network. Trans Emerging Tel Tech. 2021; 32:e4014.
https://doi.org/10.1002/ett.4014
Network-data Packet Analyzer [Tcpdump]. (2021) Retrieved from
https://www.tcpdump.org/
Uddin, M., Abdul Rahman, A., Uddin, N., Memon, J. and Kazi, S. (2013). Signature-
based Multi-Layer Distributed Intrusion Detection System using Mobile Agents.
International Journal of Network Security. 15. 79-87.
Vigneswaran, R. K., Vinayakumar, R., Soman, K. P. and Poornachandran, P. (2018)
Evaluating Shallow and Deep Neural Networks for Network Intrusion Detection
Systems in Cyber Security, 2018 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 2018, pp. 1-6, doi:
10.1109/ICCCNT.2018.8494096.
Vinayakumar, R., Alazab, M., Soman K. P., Poornachandran, P., Al-Nemrat, A. and
Venkatraman, S., Deep Learning Approach for Intelligent Intrusion Detection
System (2019), in IEEE Access, vol. 7, pp. 41525-41550, 2019, doi:
10.1109/ACCESS.2019.2895334.
ACCESS.2019.2905633.
Xu, C., Shen, J., Du, X., and Zhang, F. (2018) An Intrusion Detection System Using a
Deep Neural Network With Gated Recurrent Units, in IEEE Access, vol. 6, pp.
48697-48707, 2018, doi: 10.1109/ACCESS.2018.2867564.
32
Yang, Y., Zheng, K., Wu, C. and Yang, Y. (2019) Improving the Classification
Effectiveness of Intrusion Detection by Using Improved Conditional Variational
AutoEncoder and Deep Neural Network. Sensors. 19. 2528. 10.3390/s19112528.
Yang, Y., Zheng, K., Wu, B., Yang, Y. and Wang, X. (2020) Network Intrusion
Detection Based on Supervised Adversarial Variational Auto-Encoder With
Regularization, in IEEE Access, vol. 8, pp. 42169-42184, 2020, doi:
10.1109/ACCESS.2020.2977007.
Yin, C., Zhu, Y., Fei, J. and He, X. (2017) A Deep Learning Approach for Intrusion
Detection Using Recurrent Neural Networks, in IEEE Access, vol. 5, pp. 21954-
21961, 2017, doi: 10.1109/ACCESS.2017.2762418.
Zhang, H., Wu, C. Q., Gao, S., Wang, Z., Xu, Y. and Liu, Y. (2018) An Effective Deep
Learning Based Scheme for Network Intrusion Detection, 2018 24th
International Conference on Pattern Recognition (ICPR), 2018, pp. 682-687,
doi: 10.1109/ICPR.2018.8546162.
Zhang, J. and Mani, I. (2003) KNN Approach to Unbalanced Data Distributions: A
Case Study Involving Information Extraction. Proceeding of International
Conference on Machine Learning (ICML 2003), Workshop on Learning from
Imbalanced Data Sets, Washington DC, 21 August 2003.
Zhang, H., Huang, L., Wu, C. Q. and Li, Z. (2020) An effective convolutional neural
network based on SMOTE and Gaussian mixture model for intrusion detection
in imbalanced dataset, Computer Networks, Volume 177, 2020, 107315, ISSN
1389-1286, https://doi.org/10.1016/j.comnet.2020.107315.
Zhiqiang, L., Mohi-Ud-Din G., Bing, L., Jianchao, L., Ye, Z. and Zhijun, L. (2019)
Modeling Network Intrusion Detection System Using Feed-Forward Neural
Network Using UNSW-NB15 Dataset, 2019 IEEE 7th International Conference
on Smart Energy Grid Engineering (SEGE), 2019, pp. 299-303, doi:
10.1109/SEGE.2019.8859773.
Zhong, M., Yajin, Z. and Chen, G. (2021) Sequential Model Based Intrusion Detection
System for IoT Servers Using Deep Learning Methods. Sensors. 2021;
21(4):1113. https://doi.org/10.3390/s21041113
33
APPENDIX
APPENDIX A) SOURCE CODE
The source codes of 10 different sampling methods and 3 different classifiers
written in Python are provided in https://github.com/aokanarik/ROGONG-IDS.
34
RESUME
He graduated from Işık University Management Information Systems in 2019.
He has been working as a research assistant in the Department of Management
Information Systems at Yeditepe University since 2020. Prior to this profession, he
worked as a data scientist in the research and development center of a consulting firm
in Istanbul. His research interests are artificial intelligence, machine learning and deep
learning.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Network intrusion detection systems play an important role in protecting the network from attacks. However, Existing network intrusion data is imbalanced, which makes it difficult to accurately detect minority attacks, and the training and detection time of deep neural network detection systems is relatively long. According to these problems, this paper proposes a network intrusion detection system based on adaptive synthetic (ADASYN) oversampling technology and LightGBM. First, we normalize and one-hot encode the original data through data preprocessing to avoid the impact of the maximum or minimum value on the overall characteristics. Second, we increase the minority samples by ADASYN oversampling technology to solve the problem of the low detection rate of minority attacks due to the imbalance of the training data. Finally, the LightGBM ensemble learning model is used to further reduce the time complexity of the system while ensuring the accuracy of detection. Through experimental verification on the NSL-KDD, UNSW-NB15 and CICIDS2017 data sets, the results show that the detection rate of minority samples can be improved after ADASYN oversampling, thereby improving the overall accuracy rate. The accuracy of the proposed algorithm is up to 92.57%, 89.56% and 99.91% respectively in the three test sets, and it consumes less time in the training and detection process, which is superior to other existing methods.
Article
Full-text available
As the rapid development of information and communication technology systems offers limitless access to data, the risk of malicious violations increases. A network intrusion detection system (NIDS) is used to prevent violations, and several algorithms, such as shallow machine learning and deep neural network (DNN), have previously been explored. However, intrusion detection with imbalanced data has usually been neglected. In this paper, a cost-sensitive neural network based on focal loss, called the focal loss network intrusion detection system (FL-NIDS), is proposed to overcome the imbalanced data problem. FL-NIDS was applied using DNN and convolutional neural network (CNN) to evaluate three benchmark intrusion detection datasets that suffer from imbalanced distributions: NSL-KDD, UNSW-NB15, and Bot-IoT. The results showed that the proposed algorithm using FL-NIDS in DNN and CNN architecture increased the detection of intrusions in imbalanced datasets compared to vanilla DNN and CNN in both binary and multiclass classifications.
Article
Full-text available
Serious concerns regarding vulnerability and security have been raised as a result of the constant growth of computer networks. Intrusion detection systems (IDS) have been adopted by network administrators to provide essential network security. Commercial IDS in the market do not have the capability to identify novel attacks but generate false alarms for legitimate user activities. Neural networks can be applied for the solution of these issues and for providing improved accuracy. Correlation‐based attribute selection ranks the features according to the highest correlation between the attributes and class label. In this article, the authors propose a correlation‐based feature selection integrated with neural network for identifying anomalies. Experimental analysis performed on NSL‐KDD and UNSW‐NB datasets, which are benchmark datasets of intrusion detection with current attacks. The results show that the proposed model is superior in terms of accuracy, sensitivity, and specificity in comparison with some of the state‐of‐the‐art techniques. With the emergence of the Internet of Things Technology, such IDS can be deployed for securing the IoT servers in future. Wireless payment systems can be secured by building and deploying IDS. A secure integrated network management can be achieved which is error‐free and thereby improving performance. An integrated intrusion detection system is developed using correlation‐based feature selection and artificial neural network. Optimal features are retrieved using correlation‐based feature selection. The system is trained and tested using an artificial neural network. Two benchmark datasets are used for analysis namely NSL‐KDD and UNSW‐NB dataset. Improved accuracy and specificity is obtained in comparison to the state of the art techniques proposed for IDS.
Article
Full-text available
Networks had an increasing impact on modern life since network cybersecurity has become an important research field. Several machine learning techniques have been developed to build network intrusion detection systems for correctly detecting unforeseen cyber-attacks at the network-level. For example, deep artificial neural network architectures have recently achieved state-of-the-art results. In this paper a novel deep neural network architecture is defined, in order to learn flexible and effective intrusion detection models, by combining an unsupervised stage for multi-channel feature learning with a supervised one exploiting feature dependencies on cross channels. The aim is to investigate whether class-specific features of the network flows could be learned and added to the original ones in order to increase the model accuracy. In particular, in the unsupervised stage, two autoencoders are separately learned on normal and attack flows, respectively. As the top layer in the decoder of these autoencoders reconstructs samples in the same space as the input one, they could be used to define two new feature vectors allowing the representation of each network flow as a multi-channel sample. In the supervised stage, a multi-channel parametric convolution is adopted, in order to learn the effect of each channel on the others. In particular, as the samples belong to two different distributions (normal and attack flows), the samples labelled as normal should be more similar to the representation reconstructed with the normal autoencoder than that of the attack one, and viceversa. This expected dependency will be exploited to better disentangle the differences between normal and attack flows. The proposed neural network architecture leads to better predictive accuracy when compared to competitive intrusion detection architectures on three benchmark datasets.
Article
Full-text available
Intrusion detection system (IDS) plays an important role in network security by discovering and preventing malicious activities. Due to the complex and time-varying network environment, the network intrusion samples are submerged into a large number of normal samples, which leads to insufficient samples for model training and detection results with a high false detection rate. According to the problem of data imbalance, we propose a network intrusion detection algorithm combined hybrid sampling with deep hierarchical network. Firstly, we use the one-side selection (OSS) to reduce the noise samples in majority category, and then increase the minority samples by Synthetic Minority Over-sampling Technique (SMOTE). In this way, a balanced dataset can be established to make the model fully learn the features of minority samples and greatly reduce the model training time. Secondly, we use convolution neural network (CNN) to extract spatial features and Bi-directional long short-term memory (BiLSTM) to extract temporal features, which forms a deep hierarchical network model. The proposed network intrusion detection algorithm was verified by experiments on the NSL-KDD and UNSW-NB15 dataset, and the classification accuracy can achieve 83.58% and 77.16%, respectively.
Article
Network Intrusion Detection System (NIDS) is a key security device in modern networks to detect malicious activities. However, the problem of imbalanced class associated with intrusion detection dataset limits the classifier’s performance for minority classes. To improve the detection rate of minority classes while ensuring efficiency, we propose a novel class imbalance processing technology for large-scale dataset, referred to as SGM, which combines Synthetic Minority Over-Sampling Technique (SMOTE) and under-sampling for clustering based on Gaussian Mixture Model (GMM). We then design a flow-based intrusion detection model, SGM-CNN, which integrates imbalanced class processing with convolutional neural network, and investigate the impact of different numbers of convolution kernels and different learning rates on model performance. The advantages of the proposed model are verified using the UNSW-NB15 and CICIDS2017 datasets. The experimental results show that i) for binary classification and multiclass classification on the UNSW-NB15 dataset, SGM-CNN achieves a detection rate of 99.74% and 96.54%, respectively; ii) for 15-class classification on the CICIDS2017 dataset, it achieves a detection rate of 99.85%. We compare five imbalanced processing methods and two classification algorithms, and conclude that SGM-CNN provides an effective solution to imbalanced intrusion detection and outperforms the state-of-the-art intrusion detection methods.
Article
The prevention of intrusion is deemed to be a cornerstone of network security. Although excessive work has been introduced on network intrusion detection in the last decade, finding an Intrusion Detection Systems (IDS) with potent intrusion detection mechanism is still highly desirable. One of the leading causes of the high number of false alarms and a low detection rate is the existence of redundant and irrelevant features of the datasets, which are used to train the IDSs. To cope with this problem, we proposed a double Particle Swarm Optimization (PSO)-based algorithm to select both feature subset and hyperparameters in one process. The aforementioned algorithm is exploited in the pre-training phase for selecting the optimized features and model’s hyperparameters automatically. In order to investigate the performance differences, we utilized three deep learning models, namely, Deep Neural Networks (DNN), Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN), and Deep Belief Networks (DBN). Furthermore, we used two common IDS datasets in our experiments to validate our approach and show the effectiveness of the developed models. Moreover, many evaluation metrics are used for both binary and multiclass classifications to assess the model’s performance in each of the datasets. Finally, intensive quantitative, Friedman test, and ranking methods analyses of our results are provided at the end of this paper. Experimental results show a significant improvement in network intrusion detection when using our approach by increasing Detection Rate (DR) by 4% to 6% and reducing False Alarm Rate (FAR) by 1% to 5% from the corresponding values of same models without pre-training on the same dataset.
Article
Chronic Kidney Disease (CKD) is a menace that is affecting 10% of the world population and 15% of the South African population. The early and cheap diagnosis of this disease with accuracy and reliability will save 20,000 lives in South Africa per year. Scientists are developing smart solutions with Artificial Intelligence (AI). In this paper, several typical and recent AI algorithms are studied in the context of CKD and the extreme gradient boosting (XGBoost) is chosen as our base model for its high performance. Then, the model is optimized and the optimal full model trained on all the features achieves a testing accuracy, sensitivity and specificity of 1.000, 1.000 and 1.000, respectively. Note that, to cover the widest range of people, the time and monetary costs of CKD diagnosis have to be minimized with fewest patient tests. Thus the reduced model using fewer features is desirable while it should still maintain high performance. To this end, the set-theory based rule is presented which combines a few feature selection methods with their collective strengths. The reduced model using about a half of the original full features performs better than the models based on individual feature selection methods and achieves accuracy, sensitivity and specificity of 1.000, 1.000 and 1.000, respectively.