Content uploaded by Mansi Bhavsar
Author content
All content in this area was uploaded by Mansi Bhavsar on Mar 10, 2024
Content may be subject to copyright.
Intrusion-based Attack Detection Using Machine
Learning Techniques for Connected Autonomous Vehicle
Mansi Bhavsar1, Kaushik Roy2, Zhipeng Liu3, John Kelly4, Balakrishna Gokaraju5
1 North Carolina A&T State University, Greensboro, NC, USA
mhbhavsar@aggies.ncat.edu
2 North Carolina A&T State University, Greensboro, NC, USA
kroy@ncat.edu
3 North Carolina A&T State University, Greensboro, NC, USA
zilu2@aggies.ncat.edu
4 North Carolina A&T State University, Greensboro, NC, USA
jck@ncat.edu
5 North Carolina A&T State University, Greensboro, NC, USA
bgokalraju@ncat.edu
Abstract. With advancements in technology, an important issue is ensuring the security of
self-driving cars. Unfortunately, hackers have been developing increasingly complex and harm-
ful cyberattacks, making them difficult to detect. Furthermore, due to the diversity of the data
exchanged amongst these vehicles, traditional algorithms face difficulty detecting such threats.
Therefore, a network intrusion detection system is essential in a connected autonomous vehicle's
communication infrastructure. The IDS (intrusion detection system) aims to secure the network
by identifying malicious and abnormal traffic in real-time. This paper focuses on the data pre-
processing, feature extraction, attack detection for such a system.
Additionally, it will compare the performance of this proposed IDS when operating in differ-
ent machine learning models. We apply Linear Regression (LR), Linear Discriminant Analysis
(LDA), K Nearest Neighbors (KNN), Classification and Regression Tree (CART), and Support
Vector Machine (SVM) to classify the NSL-KDD dataset. The dataset was classified using binary
and multiclass classification to train and test files. This data resulted in 94% and 98% accuracy
for the train and test files, respectively, with KNN and CART algorithms.
Keywords: Machine learning, Autonomous vehicle, cyberattacks, intru-
sion, data preprocessing, feature engineering, ML model, accuracy.
1 Introduction
As advances in machine learning (ML) and deep neural networks (DNN) bring colossal
potential to search for and develop self-driving cars a reality. Technology advance-
ments in both the software and hardware side open new doors for huge applications in
different domains. Many companies are in the race to develop safe and secure autono-
mous cars. (Such as Ford, Toyota, NVIDIA, NCA&T, and many more). Due to its com-
munication system, more chances of threat surface access to exploit system vulnerabil-
ities for malicious hackers. Connected autonomous vehicles (CAV) is a transformative
technology that has increased potential in the research area. It helps reduce traffic
2
congestion and accidents, improve efficiency, and the improved quality of vehicular
systems. Moreover, using developed technologies such as ML, big data, IoT, and shar-
ing economy extensively benefit intelligent cities. [1]
Autonomous cars are vulnerable to different security threats. Network security is an
important topic, and an intrusion detection system (IDS) can help us mitigate network
threats without disrupting the safety and security of the host and the network. An intru-
sion detection system (IDS) is being implemented by applying ML techniques. It may
be grouped by Host-based IDS and Network-based IDS, described by its placement
over the network system [2]. There have been two types of detection: Misuse and anom-
aly detection. The misuse detects the known attacks, whereas anomaly detects the ab-
normal behavior. ML models are used to build anomaly-based detection systems. ML
models also assist in feature engineering. This paper uses an existing labeled dataset
(NSL-KDD) [3] to evaluate an anomaly-based intrusion detection system (IDS) to mit-
igate the threats and attacks. Dataset has made researchers compare different IDS meth-
ods and build an IDS system, either host or network-based. We apply different ML
algorithms and present a comparative analysis.
The rest of the paper is organized as follows. Section II covers the related work.
Section III includes the methodology, including data description, data preprocessing,
and feature engineering. Section IV presents results and performance metrics. Finally,
section V is the discussion and conclusion.
2 Literature Review
Cyber threats have become an essential issue with the emergence of self-driving vehi-
cles and require the system to provide safe and secure connected vehicles. NSL-KDD
[3] dataset is a refined version of the KDDcup99 data set [4]. Many analyses have been
taken place by applying different techniques and tools to develop an effective intrusion
detection system. The detailed implementation of various machine learning techniques
with the WEKA tool [5]. A detailed description of the dataset is given in [6].
The problem of redundancy gets biased while learning, which might be one reason
why a specific classifier shows an accuracy of above 95% [7,8]. In [9], results show
that the machine learning algorithm does not produce good results in the case of detec-
tion of misuse. In [10], the author compared the supervised ML classifiers for intrusion
detection in a network. The efficient algorithm has been selected via performance ma-
trices and concluded that Random Forest performs better than other classifiers. Authors
in [11] proposed a lightweight IDS method that focuses on data-preprocessing to use
essential features. They removed the redundant data from the dataset, which helps them
get unbiased results with machine learning models. Different feature selection tech-
niques have been used, such as wrapper or embedded feature selection, to improve the
results [12]. Correlation-based feature selection filter methods have been used, which
verify the model's effectiveness in terms of the detection rate as keeping a low false-
positive rate with the use of a full attack scenario [13]. The IDS uses supervised
3
machine learning models to detect normal and abnormal attacks [14]. The proposed
method only classifies the Denial of Service (DOS) and probe attacks, but the remaining
episodes are not considered.
The [15] proposed anomaly intrusion detection using an improved self-Adaptive
Bayesian algorithm to process a large amount of data. The paper proposed an intrusion
detection method using a support vector machine [16]. They used the feature removal
method to improve the efficiency of the algorithm.
3 Methodology
3.1 Dataset Used
The project used the NSL-KDD [3] dataset with 42 attributes. Data is an improved
version of the KDD99 dataset [17], a standard dataset for intrusion detection. The
dataset has several versions available, from which the KDDTrain+ and KDDTest+
(training and testing data, respectively considered, which have a total of 125912 and
22544 instances.
The dataset contains network attacks related to the autonomous vehicle, including
the 24 training attack types with 14 classes in the test file. Therefore, the dataset has
KDD_train.csv and KDD_test.csv, which are not recorded from the same probability
distribution, making it more realistic. Moreover, some intrusion experts believe that
most novel attacks are variants of known attacks, and those can be sufficient to catch
the novel variants.
This is a classification problem. The dataset description is given in Figure 1, which
provides the total instances of both files. The measure of different attacks and features
of a dataset is shown in Table 1.
Table 1Detailed instances in the dataset
NSL-KDD
Total in-
stances
Normal
Dos
Probe
R2L
U2R
KDD_train
125973
67343
45927
11656
995
52
KDD_test
22544
9711
7460
2885
2421
67
The Normal traffic shows no attack recorded, and the other four subtypes show the
documented Distributed Denial of service attack (DDoS). DDoS is a malicious attempt
to disrupt the traffic of a targeted server [18]. The four sub attacks description is given
below:
➢ DOS: (Denial of Service) is recorded when overloading the server with too
many requests to be handled.
➢ Probe: the hacker scans the network to misuse a known vulnerability.
4
➢ R2L: (remote-to-local) attacks in which the attacker tries to gain local access
to unauthorized information through sending packets to the victim machine.
➢ U2R: (user to root) attack where an attacker gets the core access of the system
using his standard account to exploit the system vulnerabilities.
3.2 Data Preprocessing
It is vital to preprocess the dataset to apply the ML techniques to any given dataset. The
less essential attributes in the dataset do not affect the accuracy of the classifier we want
to use. This report aims to provide the complete preprocessing steps of the two files of
the NSL-KDD [3] dataset. To preprocess the dataset, python-Anaconda-navigator (Ju-
pyter notebook) was used. The same methods were used [19] to preprocess the dataset
of both files.
Preprocessing contains:
• load the dataset and analyze the statistics of the dataset.
• Change sub attach labels to their respective class.
• Check the missing value in the dataset.
• Check the outliers.
After performing the above steps, the dataset contains no missing values but outliers.
The results of paired boxplot figure 1 with different ranges (a & b). So, the IQR (Inter-
quartile range) method was used to remove outliers. However, the box and whisker plot
(in figure 2) provide removed outliers scenario between attributes. The shape of the
cleaned data (which is (40118,42)) is not suitable as the dataset losses more than half
of the information, which is not acceptable because if we keep the cleaned dataset, then
the model overfits. (Which checked with applying the spot-check algorithm and getting
0.99 accuracy results).
(a) range 0-25
(b) range 0-1200
Figure 1 Paired Box Plot
5
The histogram result (shown in figure 3) helped consider the data itself. This is be-
cause many attributes such as (duration, host_srv_rate, serror_rate, …) have many val-
ues in one class, whereas the difference between the two classes is not distinct. (The
Figure 2 Box and Whisker plot
6
difference is too significant, and it considered the outlier as the highest value compared
to that one class which means it dropped the valuable figures from the dataset.)
It is not worth removing the outliers from the dataset and keeping the original data for
our use case.
Preprocessing steps continue…
• below are the converted labels into five classes for multiclass classification
for training and testing.
• Used binary classification for changing attack labels into two categories: nor-
mal and abnormal attacks, with the help of a label encoder.
Figure 3 Histogram plot
7
• Used multiclass classification for changing attack labels into five usual cate-
gories, R2L, Probe, U2R, and Dos, respectively, with the help of a Label en-
coder.
3.3 Feature Engineering
Feature engineering is a crucial step to improve the performance of ML techniques.
The feature selection is made using the Pearson correlation method [19]. It is a
standard method used for filtering the essential features from the dataset. The paper
research [20] concluded that a correlation coefficient value below 0.2 is considered
a negligible correlation. We selected the threshold value of 0.5 to extract the feature
with moderate to high correlation. A correlation matrix completes the filtering for
more than 0.5 correlation attributes with encoded attack label features selected for
binary and multiclass classification.
Figure 4 explains the features selected with the highest correlation greater than 0.5
and selected that attribute for binary and multiclass classification. The same procedure
was followed for both train and test files.
4 Results and Discussion
We applied logistic regression (LR), support vector machine (SVM), K-nearest neigh-
bor (KNN), classification and regression tree (CART), and linear discriminant analysis
(LDA) to the modified version of the NSL-KDD dataset. The above models are stand-
ard for machine learning models with their respective advantages and disadvantages,
making them unique. This algorithm works efficiently according to the user's use case
Figure 4 Feature Selection with the correlation method for train and test file
8
and data type. The results of the model’s accuracy are shown below in Tables 2 & 3
and model comparison in Figures 5 & 6, respectively, for the train and test files.
Table 2 train accuracy for binary and multiclass classification
The train Table 2 gives the likely results for binary and multiclass classification. It
Accuracy (%)
Binary
Multiclass
LR
96.95
94.43
LDA
96.68
93.14
KNN
98.52
98.25
CART
98.45
98.25
SVM
97.24
95.91
Figure 5 train data box plot for model comparison of binary and multiclass classification
Figure 6 test data box plot for model accuracy of binary and multiclass classification
9
is a classification problem; therefore, the CART algorithm and KNN give higher
accuracy results than LR and LDA. Boxplot Figures 5 & 6 describe the model accu-
rately predicting binary and multiclass classification data for algorithm comparison.
The accuracy of multiclass classification for test files the LR and SVM algorithm
accuracies were increased by tuning the parameters. Tolerance and different solvers
were used for LDA from the Sklearn documentation, but it didn’t help improve the
accuracy.
Table 3 test accuracy for binary and multiclass classification
Test Table 3 gives the acceptable results for binary and multiclass classification.
Since it is a classification problem, the CART algorithm and KNN give higher accuracy
results than LR and LDA. Therefore, it can be said that the model is accurate and pre-
dicted with 94% overall accuracy of the suitable attack classes with the CART and
KNN algorithm.
Performance matrices check the model's performance, behavior, and activities. It de-
scribes whether the model is performing well or not, predicting as per the test data.
Different parameters of performance matrices were checked, and the results for binary
and multiclass classification of test and train data files are given in Tables. 4 and 5,
respectively.
Table 4 performance metrics for binary classification
Accuracy (%)
Binary
Multiclass
LR
92.45
90.16
LDA
92.45
87.76
KNN
94.98
94.84
CART
94.47
95.78
SVM
92.00
91.02
Performance Metrics for binary classification
Accuracy
KDD_test
KDD_train
Precision
0.875
0.967
Recall
0.942
0.973
Accuracy
0.917
0.967
False alarm rate
0.05803
0.0270
10
Table 5 performance metrics for multiclass classification
It is evident from Tables (4 & 5) that our model outperforms with high accuracy of
91% for the test, 97% for train binary classification, a recall of 94% for the test, and
97% for train binary classification to classify the attacks. The CART, KNN, and SVM
perform much better as its mostly preferred classifiers for the classification problem.
Each sort achieves higher precision because they have a relatively low false alarm rate,
which is consistent. The results are similar in pattern for multiclass classification.
5 Conclusion
This paper briefly explains how the ML algorithms are applied to the CAV dataset step
by step. The procedure starts with data massaging, feature extraction, and machine
learning classifiers. Next, the dataset predicts intrusion-based attacks on a self-driving
car. Finally, the results were studied to predict the DDoS attack with binary and mul-
ticlass classification. The accuracy of models is comparatively similar for multiclass
and binary (approximately 94 to 98%).
Furthermore, the feature selection helps to reduce the training and testing times. Testing
is performed with the help of 10-fold cross-validation, where each fold is used once for
testing and nine times for training. Results showed that the proposed preprocessing and
feature selection method delivers excellent accuracy in the model.
6 Future Scope
The model results are reasonable for both cases. However, the gap here is in the class
attack dataset instances. When looking back into the attack labels values per class, there
has been an imbalance of the dataset in the U2R attack (class 3- the value is 52 in-
stances), which is significantly less compared to the other courses, which may be one
of the reasons behind the varying accuracy or dropping the accuracy in multiclass clas-
sification than the binary classification. Furthermore, it can be limited to see the
Performance Metrics for multiclass classification
Accuracy
KDD_test
KDD_train
Precision
0.898
0.952
Recall
0.900
0.961
Accuracy
0.900
0.961
False alarm rate
0.1572
0.0933
11
difference while removing the third class(U2R), with fewer attributes. To add to it, as
the details do not have effective differences amongst them, maybe removing the third
class itself will not improve the results by a reasonable amount. Unlikely, based on the
mechanisms of attack on IoT, exterior features are being evaluated and need to be con-
sidered. A data spike can be retrieved in the simple anomaly technique as future attacks
exist. Thus, we believe that our model is capable of such intrusions. In the future, we
will improve dataset techniques (such as interpolation) to check whether that helps with
the prediction model or improves the data redundancy and vulnerability of the novel
attacks. We will also consider various outlier handling techniques.
Acknowledgment
This research is supported by Palo Alto Networks. Any opinions, findings, conclusions,
or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of Palo Alto Networks.
References
1. Bimbraw K: Autonomous Cars: Past, Present, and Future. A Review of the Devel-
opments in the Last Century, the Present Scenario, and the Expected Future of Au-
tonomous Vehicle Technology, In: In Proceedings of the 12th International Con-
ference on Informatics in Control, Automation and Robotics, pages 191-198,
ICINCO, (2015).
2. Ieracitano C, Adeel A, Morabito F C, Hussain A, A novel statistical analysis and
autoencoder driven intelligent intrusion detection approach., Neurocomputing,
387, Doi: https://doi.org/10.1016/j.neucom.2019.11.016, 51–62, (2020).
3. M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, A Detailed Analysis of the
KDD CUP 99 Data Set, Submitted to Second IEEE Symposium on Computational
Intelligence for Security and Defense Applications, CISDA, (2009).
4. Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani, A Detailed
Analysis of the KDD CUP 99 Data Se, In: Proceedings of the 2009 IEEE Sympo-
sium on Computational Intelligence in Security and Defense Applications,
CISDA, (2009).
5. S. Revathi, Dr. A. Malathi, A Detailed Analysis on NSL-KDD Dataset Using Var-
ious Machine Learning Techniques for Intrusion Detection, International Journal
of Engineering Research & Technology (IJERT), ISSN: 2278-0181, Vol. 2 Issue
(12, December-2013).
6. L. Dhanabal1, Dr. S.P. Shantharajah, A Study on NSL-KDD Dataset for Intrusion
Detection System Based on Classification Algorithms, International Journal of
Advanced Research in Computer and Communication Engineering Vol. 4, Issue 6,
(June 2015).
7. S. Revathi, Dr. A. Malathi, A detailed analysis of KDD cup99 Dataset for IDS, In-
ternational, Journal of Engineering Research & Technology (IJERT) Vol. 2, Issue
12, December (2013).
8. R. P. Lippmann, D. J. Fried, and I. Graf, Evaluating intrusion detection systems:
The 1998 DARPA off-line intrusion detection evaluation, In: Proceedings of the
2000 DARPA Information Survivability Conference and Exposition, DISCEX’00,
(2000).
12
9. Maheshkumar Sabhnani, Gursel Serpen, Why Machine Learning Algorithms Fail
in Misuse Detection on KDD Intrusion Detection Dataset, ACM Transactions on
Intelligent Data Analysis, pp.403-415, (2004).
10. Y. Hamid, M. Sugumaran, and V. R. Balasaraswathi. 2016, IDS Using Machine
Learning - Current State of Art and Future Directions, Current Journal of Applied
Science and Technology. 15, 3 (Mar 2016), 1-22, doi:10.9734/BJAST/2016/23668
11. J. Manjula C. Belavagi and Balachandra Muniyal, Performance Evaluation of Su-
pervised Machine Learning Algorithms for Intrusion Detection, 25th International
Multi-Conference on Information Processing, (2016).
12. Yasmen Wahba, Ehab Elsalamouny, Ghada Eltaweel, Improving the performance
of multiclass intrusion detection systems using feature reduction, Research gate,
article, (June 2015)
13. Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, Toward Generat-
ing a New Intrusion Detection Dataset and Intrusion Traffic Characterization, 4th
International Conference on Information Systems Security and Privacy, ICISSP,
Portugal, (January 2018)
14. P.Sangkatsanee, N. Wattanapongsakorn, C. Charnsripinyo, Practical Real-Time
Intrusion Detection Using Machine Learning Approaches, Computer Communica-
tions, Vol. 34, no. 18, pp. 2227–2235, (2011).
15. D. M. Farid and M. Z. Rahman, Anomaly network intrusion detection based on
improved self-adaptive Bayesian algorithm, Journal of Computers, vol. 5, no. 1,
pp. 23–31, (2010)
16. Y. Li, J. Xia, S. Zhang, J. Yan, X. Ai, and K. Dai, An Efficient Intrusion Detection
System Based on Support Vector Machines and Gradually Feature Removal
Method, Expert Systems with Applications, vol. 39, no. 1, pp. 424430, (2012).
17. The NSL-KDD dataset from the Canadian Institute for Cybersecurity (an updated
version of the original KDD Cup 1999 Data (KDD99) https://www.unb.ca/cic/da-
tasets/nsl.html
18. Iman Sharafaldin, Arash Habibi Lashkari, Saqib Hakak, Ali A. Ghorbani, Devel-
oping Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxon-
omy, IEEE 53rd International Carnahan Conference on Security Technology,
Chennai, India, (2019).
19. https://github.com/abhinav-bhardwaj/Network-Intrusion-Detection-Using-Ma-
chine-Learning/blob/master/README.md
20. M. M. Mukaka, Statistics Corner: A guide to the appropriate use of Correlation
coefficient in medical research, Malawi Med. J., vol. 24, no. September, pp. 69–
71, (2012).