Conference PaperPDF Available

A Review on Cyber Security Datasets for Machine Learning Algorithms

Authors:

Abstract and Figures

It is an undeniable fact that currently information is a pretty significant presence for all companies or organizations. Therefore protecting its security is crucial and the security models driven by real datasets has become quite important. The operations based on military, government, commercial and civilians are linked to the security and availability of computer systems and network. From this point of security, the network security is a significant issue because the capacity of attacks is unceasingly rising over the years and they turn into be more sophisticated and distributed. The objective of this review is to explain and compare the most commonly used datasets. This paper focuses on the datasets used in artificial intelligent and machine learning techniques, which are the primary tools for analyzing network traffic and detecting abnormalities.
Content may be subject to copyright.
A Review on Cyber Security Datasets for Machine Learning
Algorithms
Ozlem Yavanoglu Murat Aydos
Hacettepe University Hacettepe University
Department of Computer Engineering Department of Computer Engineering
Ankara, Turkey Ankara,Turkey
milletseverozlem@gmail.com maydos@hacettepe.edu.tr
AbstractIt is an undeniable fact that currently information is a
pretty significant presence for all companies or organizations.
Therefore protecting its security is crucial and the security models
driven by real datasets has become quite important. The
operations based on military, government, commercial and
civilians are linked to the security and availability of computer
systems and network. From this point of security, the network
security is a significant issue because the capacity of attacks is
unceasingly rising over the years and they turn into be more
sophisticated and distributed. The objective of this review is to
explain and compare the most commonly used datasets. This paper
focuses on the datasets used in artificial intelligent and machine
learning techniques, which are the primary tools for analyzing
network traffic and detecting abnormalities.
Keywords Cyber Security, Data Mining, Artifical Intelligent,
Machine Learning, Benchmarking.
I. INTRODUCTION
Cyber security is the set of applying security preventions to
provide confidentiality, integrity, and availability of data [1].
Numerous descriptions are made about cyber security in the
literature. According to Canongia and Mandarino, “The art of
ensuring the existence and continuity of the information society
of a nation, guaranteeing and protecting, in Cyberspace, its
information, assets and critical infrastructure [2]. Cyber
security is a significant research area because all of the
operations based on government, military, commercial,
financial and civilians gather, process, and store tremendous
volume of data on computers and others [1-3]. In order to be on
the defensive side on cyber security, companies require
organization of its efforts throughout its whole information
system. The components of cyber security consist of network
security, application security, mobile security, data security,
endpoint security and so on [3].
Over the last few years, the use of the Internet and computer
applications has seen an immense expansion and they have turn
into the integral part of today’s generation of people. With the
exponential increase of computer applications and computer
networks usage, security is becoming increasingly more
significant [4-5]. Attackers are able to potentially use several
paths by means of application to do havoc to your business or
organization. Figure 1 illustrates some potential attacks and
threats to organizations.
All of these paths symbolizes a risk that may or may not be
serious enough to warrant attention [4-6]. According to the
National Institute of Standards and Technology (NIST),
American companies as early as 2017 suffered losses of up to
65.6 billion dollars following IT attacks [6].
Fig 1. Activities of unidentified as potential attack and threat
to organization [2]
The huge increase in the percentage of cyber-attacks has
caused artificial intelligent and machine learning based
methodologies a crucial part in detecting security threats. In
order to provide the best security applications be accepted and
appropriate level of security be obtained, security-related
benchmarks are very important. From this point of view, they
are essential for several types of cyber security research such as
intrusion detection system. While there has been a few studies
about particular datasets, there has been less about the
comprehensive state of security-related datasets. In literature,
there are numerous studies in the field of cyber security using
various datasets [7-11]. In this study, a comprehensive review
of the current publicly available datasets is given. We also
provide a general assessment of artificial intelligent and
machine learning techniques using these datasets.
The rest of the paper is organized as follows: Part 2 explains
the essential security concepts. Part 3 presents summary of the
previous studies. Part 3 describes techniques that belong to
machine learning algorithm and artificial intelligent. Part 4
discusses major datasets and their characteristic. Finally, part 5
presents observations and concluding remarks.
II. BASIC SECURITY CONCEPTS
Cyber security is the set of applying security preventions to
provide confidentiality, integrity, and availability of data. In
this section, we explain the well-known triad of confidentiality,
integrity, and availability (CIA) of information security [9-10].
Confidentiality aims to restrict disclosures and to grant access
of information to only the authorized people. Thanks to
confidentiality, companies are able to protect their sensitive and
private assets from unauthorized hands. There are various ways
of ensuring confidentiality such as encryption, access controls,
and steganography [11].
Integrity requires protecting data in a consistent, precise, and
reliable manner. This has to guarantee that data is not altered in
the course of a specific period. In order to prevent unauthorized
users making modifications, the right processes and actions
have to be taken. Hashing, digital signatures, certificates, non-
repudiation are the tools and algorithms providing integrity [10-
11].
Availability is another security concept that the data and
resources should be available when people need to access it,
particularly during emergencies or disasters. The cyber security
specialists should handle the three common challenges for
availability; denial of service (DoS), loss of information system
capabilities because of natural disasters and equipment failures
during a normal operation [11].
III. RELATED STUDIES
In the literature, there has been considerable amount of studies
on the problem of cyber security. There are various widespread
approaches in general cyber security solutions. In this section,
we have focus on using machine learning and artificial
intelligent approaches for cyber security issues.
Chowdhury et al. proposed a new botnet detection method
based on topological feature of nodes within a graph. The
proposed methodology is able to detect anomaly by searching a
limited number of nodes. This methodology is based on self-
organizing map (SOM) clustering that belongs to a class of
unsupervised system. This study used CTU-13 datasets, the
largest dataset that contains bot labeled nodes. Furthermore,
this study used another detection algorithm, support vector
machine (SVM), for comparison. Experimental results show
that proposed methodology could be able to still detect bot with
acceptable accuracy by searching few number of nodes [12].
Huseynov et al. proposed a bio inspired computing technique
also known as ant colony clustering for detection of botnet
attacks. This proposed model is able to explore botnet hosts
quickly and precisely while not depending on its traffic
payload. At the same time, their approach was tested using two
different clustering algorithms that is ATTA-C and K-means
for comparison. ISOT dataset was preferred because of its
volume [13].
Neethu B. represents a framework that is PCA for feature
selection with Naive Bayes in order to develop a network
intrusion detection system. In this study, KDDCup 1999
intrusion detection benchmark dataset is preferred for
experiments. The results show that the performance of this
method achieves higher detection rate, less time consuming and
has low cost factor compared to the neural network and tree
algorithm based approach. In addition, proposed system
provides about 94% accuracy [14].
Rafal and et al. presented a novel method for detecting cyber-
attacks targeting web applications. This method was compared
with Naive Bayes, AdaBoost, Part and J48, which are machine-
learning algorithms. In addition, CSIC 2010 HTTP Dataset is
used for assessment of proposed model. This study specifically
focused on solutions that are using HTTP protocols to
communicate clients with the servers. The authors claimed that
this model is able to obtain the higher detection percentage
while having lower false positive rate. At the same time, the
results show that J48 method is the best approach for this
problem and true-positive value is around 0.04 [15].
Nguyen and Franke proposed an adaptive intrusion detection
system (A-IDS). This system is able to detect many different
types of attacks in the heterogeneous and adversarial network
environments. Authors conduct the experiments on two
different datasets for benchmarking Web Application
Firewalls: the ECML-PKDD 2007 HTTP dataset and the CISIC
HTTP 2010. At the same time, Naïve Bayes, Bayes network,
decision stump and RBF network, that are machine learning
algorithms, are used for comparison with the proposed method.
The experimental results illustrated that, in the case of the CSIC
2010 dataset it provides almost 10% and 8% higher accuracies
than the best IDS which is the Bayes Network-based IDS, and
the Hedge/Boosting algorithm, respectively [16].
Xie and et al. focused on detecting anomalies with a short
sequence model. In this study, a novel anomaly detection
system is proposed using Support Vector Machine (SVM).
ADFA-LD is used for conducting experiments. For this
experiments, k values were selected k = 3, 5, 8, 10 and the best
achievement is obtained with k=5, where average ACC of 70%
is achieved at a FPR of around 20%. The experimental result
represents that it not only provide a satisfactory achievement,
but also decrease the computational cost largely [17].
Zamani and Movahedi represent several models for detecting
intrusion. In this study, these models are divided based on
classical artificial intelligence (AI) and based on computational
intelligence (CI) such as genetic algorithms and fuzzy logic.
They conducted various experiments and compared their
algorithms’ performance. Experimental results shows that
decision tree algorithm has achieved the best results. On the
other hand, this study explained how different features of CI
models could be used to build effective IDS [18].
In order to efficiently detect various types of network
intrusions, Hoque1 et al. proposed an intrusion detection
system (IDS) based on genetic algorithms. In this study,
parameters and evolution processes of GA were explained in
details. Proposed model used evolution theory for information
evolution in order to filter the traffic data and thus decrease the
complexity. In addition, KDD99 benchmark dataset used in
order to evaluate the performance of the model. The
experimental results show that this model has achieved
reasonable detection rate [19].
Wang and Paschalidis proposed a novel approach that has
two stage in order to detect the presence of a botnet and to
identify the bots. First stage is relevant to becoming aware of
anomalies by leveraging large deviations of an empirical
distribution. In addition, this stage suggests two techniques for
creating the empirical distribution. First technique is a flow-
based approach estimating the histogram of quantized flows
and latter is a graph based approach estimating the degree
distribution of node interaction graphs. In order to detect the
bots, second stage uses social network community in a graph
that captures correlations of interactions among nodes over
time. For the experiments, they used real-world botnet traffic
that is CTU-13 dataset [20].
Bhuyan et al. introduced a new approach to create unbiased
full feature real-life network intrusion datasets in order to
compensate for the crucial lack of the available datasets. They
created a significant amount of an intrusion dataset in the
development and validation operation of detection systems. In
addition, this study explains a set of requirements for creating
an efficient dataset. Finally, six different attack scenarios were
created and discussed in this study [21].
Wijesinghe et al. focus on detecting a range of botnet families
by analyzing network traffic flows. Their proposed method
consists of two parts. First parts is that they define appropriate
dataset templates with more relevant features in order to detect
botnet from IP flows. Second part used IP flow data for
detecting botnet behaviors in unlabeled traffic. In this study,
they used public available IPFIX dataset. This approach is a
new methodology and it contributed to available IP flow based
botnet detection studies [22].
Haddadi et al. have analyzed various botnet detection
approaches based on the model used and type of data employed.
BotHunter and Snort based on public rule based systems are
two of approaches. Other approaches are based on data mining
techniques like packet payload based and traffic flow based.
This study makes use of five publicly available botnet data sets
such as CAIDA, ISOT, etc. They conducted several
experiments using C4.5, Knn (k-nearest neighbors), SVM,
Bayesian Networks. Experimental results show that the
performance of the flow based system is higher or similar to the
results reported in the literature [23].
Bhuyana et al. proposed an empirical study using different
information metrics in order to handle important security
problems such as detection of both low-rate and high-rate
DDoS attacks. They conducted several experiments using four
significant information entropy measures: Hartley entropy,
Shannon entropy, Renyin++s entropy and Renyin++s
generalized entropy for detecting DDoS attacks of various
types. CAIDA and TUIDS DDoS datasets are used for showing
efficiency and effectiveness of each metric for DDoS detection
[24].
Hoque et al. proposed a novel statistical methodology in
order to analyze DDoS attack from normal traffic. This
methodology called as Feature Feature Score (FFSc). This
study used three features from network traffic. These features
are entropy of source IPs, variation of source IPs and packet
rate. The success of the proposed model is evaluated with
CAIDA DDoS 2007 and MIT DARPA datasets. The
experimental results show that proposed model yields 98%
detection accuracy on the normalized CAIDA dataset [25].
Kato and Klyuev have developed an DDoS attack detection
system. Also, this study analyzed the characteristics of DDoS
attacks. This system used SVM with an RBF (Gaussian) kernel
from machine learning. To compare the performance of the
proposed system, three types of training and test datasets
including different patterns and different number of patterns
were created. For evaluating success of system, precision,
recall, negative predictive value (NPV) were calculated.
Development system has achieved successful results with more
than 85% accuracy with all types of dataset [26].
Saad et al. focused on detecting P2P bots that represents the
newest and most challenging types of botnets currently
available. In order to detect P2P botnet command and control
(C&C) phase, they proposed the characterization of network
traffic behaviors.
In this study, they used five machine-learning algorithms that
are nearest neighbors, linear support vector machine, artificial
neural network, and naïve bayes. The experimental results show
that true detection rate of the P2P Botnet C&C is above 90%
for the Support Vector Machine, Artificial Neural Network and
the Nearest Neighbors Classifier and the total error rate is less
than 7% [27].
A number of related detection systems are compared and the
results shown in Table 1. In particular, we compare the machine
learning techniques used for developing the detection systems
datasets used for experiments, evaluation methods considered,
baseline classifiers for comparisons, etc. in relevant studies.
IV. TECHNIQUES
A. Machine Learning
According to Stanford computer science professor Andrew
Ng, Machine learning (ML) is “the science of getting computers
to act without being explicitly programmed.” [32]. The primary
aim of ML is to build models that can take input data and utilize
statistical analysis in order to forecast an output value within an
suitable range. In the field of computer science, ML is one of
the fastest expanding areas with comprehensive applications.
ML algorithms are often classificatied as supervised,
unsupervised and Reinforcement Learning. Supervised
algorithms are the most commonly used in the machine learning
algorithms. In addition, supervised algorithms can be further
grouped into regression and classification. In literature, several
machine learning algorithms used [10-40]. Commonly used
machine-learning algorithms are;
- Linear Regression
- Logistic Regression
- Decision Tree
- SVM
- Naive Bayes
- KNN
- K-Means
- Random Forest
- Dimensionality Reduction Algorithms
B. Artifical Intelligence
AI is a field of scientific research to increase computing
power, to develop productive algorithms and well organized
knowledge. AI applies for solving complicated problems that
cannot be solved without combining intelligence, discovering
the hidden patterns from data and developing intelligent
machines [18].
AI has numerous applications on knowledge representation,
information retrieval, speech recognition, understanding
natural language, computer vision, bioinformatics, expert
systems, robotics, game playing, and cyber defense with the
help of various algorithms like artificial neural network ,genetic
algorithms, artificial immune systems, particle-swarm
intelligence, stochastic algorithms, and fuzzy logic [19, 20].
Artificial Neural Networks (ANNs), which is a technique of
AI, are set of computer algorithms that are biologically inspired
to simulate the way in which the human brain neuron processes
information [40]. ANNs gather their knowledge by detecting
the patterns and relationships among data and learn through
their architectures, transfer functions and learning algorithms
[40].
There are many types of neural networks for various
applications available in the literature [39]. Multilayered
perceptron (MLP) type neural networks are the simplest and
most commonly used neural network architectures [40].
MLPs are trained with many learning algorithms. Levenberg-
Marquardt (LM) is one of most preferred training algorithms
for MLPs.
Table 1. Outline of the Studies Presented in the Literature
Study
Technique
Dataset
Problem Domain
Evaluation Method
Feature
Selection
[25]
Statistical Method
CAIDA DDoS 2007. MIT DARPA
datasets
DDoS attack detection
Accuracy
Yes
[23]
C4.5, SVM, KNN Bayesian
Networks
Zeus (Snort), Zeus (NETRESEC),
Zeus-2 (NIMS), Conficker
(CAIDA) and ISOT-Uvic
Botnet detection
Detection Rate, False
Positive Rate
Yes
[12]
SOM
CTU-13
DDoS attack detection
Accuracy
Yes
[14]
Naïve Bayes, PCA
algorithm
KDDCup 1999
Intrusion Detection
False Positive Rate
Yes
[26]
SVM
CAIDA DDoS 2007
DDoS attack detection
Precision, Recall,
Negative Predictive
Value
Yes
[15]
Naive Bayes, AdaBoost,
Part and J48
CSIC 2010 HTTP Dataset
Web Applications Attack
False Positive Rate
No
[16]
Naïve bayes, bayes network,
decision stump RBF
network
ECML-PKDD 2007 HTTP, CSIC
HTTP 2010
Web Applications Attack
False Positive Rate
No
[17]
k nearest neighbour (kNN)
ADFA Linux data
Host-based Anomaly
Detection
Accuracy
No
[19]
Genetic Algorithm
KDDCup 1999
Intrusion Detection
Detection rate (DR)
Yes
[24]
Information Metrics
KDD Cup 1999, CAIDA , TUIDS
DDoS
DDoS Attack Detection
N/A
No
[27]
NNC ANN SVM NBC GBC
ISOT
Botnet Detection
true detection rate, Error
Rate
Yes
[28]
SVM,J48, Naive Bayes,
Logistic Regression
ISOT, UNSW-NB-15
Cloud Security
True Positive , False
Negative
No
[29]
Decision Trees Language
Modeling TF-Based
ECML-PKDD 2007 Dataset
HTTP Attacks
precision, recall
Yes
[30]
KNN-SVM
KDD99
DDoS attack detection
True Positive Rate, False
Positive Rate
Yes
[31]
Adaptive Neuro-Fuzzy
Inference System
KDD99, CAIDA
DDoS attack detection
Accuracy
No
[32]
Generic-Feature-Selection
(GeFS)
CSIC 2010 HTTP Dataset
Feauture Selection
Accuracy
Yes
[33]
Random Forest
KDD99
Feauture Selection
Accuracy
Yes
[34]
RBF, SVM
KDD99
Network Intrusion Detection
True Positive , False
Negative
Yes
[35]
Adaptive Time Dependent
Transporter Ants Clustering
ISOT
Botnet Detection
Accuracy
No
V. CYBER SECURITY DATASETS
Nowadays, several research groups put together many type
of data both for their own study purposes and to provide data to
community repositories. This section explains the existing
security-related datasets using machine learning and artificial
intelligent research.
A. KDD Cup 1999 Dataset (DARPA1998)
DARPA 1998 has gathered and deal out the first standard
data by MIT Lincoln Laboratory under Defence Advanced
Research Projects Agency (DARPA) and Air Force Research
Laboratory (AFRL) sponsorship to evaluate computer network
intrusion detection systems. KDD Cup 1999 is part of the data
collected from MIT Lincoln Labs, includes tcpdump and BSM
list files. This dataset is based on the data captured in
DARPA’98 IDS evaluation program and prepared by Stolfo et
al. [5]. Also, this dataset is considered benchmark data for
assessment of intrusion detection systems. The data includes
four main categories of attacks that are Denial-of-Service
(DoS), user-to-root (U2R), Remote to Local Attack (R2L) and
Probing Attack. Also, there are three content features and thirty-
eight numerical features in the dataset. The features consist of
basic features of individual TCP connections, content features
within a connection suggested by domain knowledge and traffic
features computed using a two-second time window. KDD’99
is one of the most wildly popular used data set to evaluate
performance of anomaly detection methods. As of today, there
are thirty researches using KDD dataset [12-17].
B. ECML-PKDD 2007 Dataset
The ECML-PKDD 2007 dataset was created for the
European Conference on Machine Learning and Knowledge
Discovery in 2007. The ECML/PKDD Discovery Challenge
was a data mining competition held in conjunction with the 18th
European Conference on Machine Learning (ECML). Table II
shows characteristics of ECML/PKDD 2007.
Table I. Features of ECML/PKDD Dataset
Training Set
Test Set
50.116
70,143
35,006 (70%)
42,006 (60%)
15,110 (30%)
28,137 (40%)
12%
11%
17%
18%
15%
16%
15%
16%
20%
18%
23%
23%
13%
12%
The dataset is described in extensible markup language
(XML). All of the sample is represented by a unique id and
consists of the three main parts that are context, class and query
[18-25].
Context parts include following features:
- Operating system running on the web server, HTTP
Server targeted by the request, Is the XPATH
technology understood by the server, Is there an LDAP
database on the Web Server?, Is there an SQL database
on the Web Server?
Query parts include features that are method, protocol, uri,
query, headers and body.
C. ISOT (Information Security and Object Technology)
Dataset
ISOT (Information Security and Object Technology)
dataset is a combination of openly available various botnets and
normal datasets that contains 1,675,424 total traffic flow. For
malicious traffic in ISOT, it was collected from French chapter
of honeynet project that consist of Storm and Waledac botnets.
Non-malicious traffic was obtained from Traffic Lab Ericson
Research in Hungary. After that, this traffic was combined with
another dataset that is created by Lawrence Berkeley National
Lab (LBNL). This compilation contains general traffic from
numerous type of applications besides that HTTP web
browsing, World of Warcraft traffic, and traffic from Azureus
bittorent client. Thus, this traffic is considerable big dataset for
Ericson Lab. LBNL network trace covered 22 subnets from
2004 to 2005. Moreover, LNBL traffic consists of a medium-
sized enterprise network and involves five huge datasets [28].
D. HTTP CSIC 2010 Dataset
The HTTP CSIC 2010 dataset involves several thousands
of web requests that generated automatically and developed at
Information Security Institute of CSIC (Spanish Research
National Council). The dataset can be used for testing web
attack protection systems. This data consist of 6,000 normal
requests and more than 25,000 anomalous requests and HTTP
requests are labeled as normal or anomalous. For convenience,
the dataset are split into three different subsets that are training,
anomalous and training. The anomalous requests refer to a
comprehensive field of application layer attacks. In this dataset,
there are three types of attacks that are static, dynamic and
unintentional illegal requests. For example, SQL injection,
CRLF injection, cross-site scripting, buffer overflows, etc are
dynamic attacks. Static attacks try to request hidden resources.
These requests include obsolete files, session ID in URL
rewrite, configuration files, default files, etc. Unintentional
illegal requests do not have malicious intention, however they
do not follow the normal behavior of the web application and
do not have the same structure as normal parameter values (for
example, a telephone number composed of letters). This dataset
has been successfully used for web detection in previous works
[40-46].
E. CTU-13 (Czech Technical University) Dataset
CTU-13 (Czech Technical University) dataset is the
combination of seizures of 13 different malware in a
nonfictional network environment. The aim of this dataset is to
capture real mixed botnet traffic. Infected hosts generated
botnet traffic and verified normal hosts generated normal
traffic. Lastly, Background traffic is a remainder of the traffic
that we do not know what it is for sure. The CTU-13 dataset
includes thirteen captures of different botnet samples, also
known as scenarios. Each of all scenarios was executed with a
particular malware that used various protocols and carried out
several actions. This dataset is one of the largest and more
labeled into existing datasets and created by CTU University of
Prague in Czech Republic in 2011. Firstly, Grill et al. have used
the CTU-13 dataset. This study compared various botnet
detection methodologies using CTU-13 dataset and proposed a
novel error metric [14]. In this study, to evaluate performance
of botnet detection, BClus and The Cooperative Adaptive
Mechanism for Network Protection (CAMNEP) and BotHunter
algorithms were used. This dataset has been used in lots of
studies. In 2014, Grill et al. used this data set to measure results
of local adaptive multivariate smoothing (LAMS) model on the
NetFlow anomaly detection. False alarm rate of anomaly
detection on intrusion detection systems has been able to be
reduced thanks to proposed model [16]. The details of the
scenario is shown in Table I with properties. The advantage of
using this dataset is that it is carefully labeled dataset and
capturing process conducted in controlled environment [25-30].
Table II. Amount of data on each botnet scenario
Dataset
Duration
(h)
NetFlows
Size
(GB)
Bot
name
Number
of bots
Botnet flow
1
6.15
2,824,637
52
Neris
1
39933
(1.41%)
2
4.21
1,808,123
60
Neris
1
18839
(1.04%)
3
66.85
4,710,639
121
Rbot
1
26759
(0.56%)
4
4.21
1,121,077
53
Rbot
1
1719 (0.15%)
5
11.63
129,833
37.6
Virut
1
695 (0.53%)
6
2.18
558,920
30
Menti
1
4431 (0.79%)
7
0.38
114,078
5.8
Sogou
1
37 (0.03%)
8
19.5
2,954,231
123
Murlo
1
5052 (0.17%)
9
5.18
2,753,885
94
Neris
10
179880
(6.5%)
10
4.75
1,309,792
73
Rbot
10
106315
(8.11%)
11
0.26
107,252
5.2
Rbot
3
8161 (7.6%)
12
1.21
325,472
8.3
NSIS.ay
3
2143 (0.65%)
13
16.36
1,925,150
34
Virut
1
38791
(2.01%)
F. The ADFA Datasets
In the field of host-based anomaly detection, most of the
existing benchmark data sets, such as UMN [2] and DARPA [3]
intrusion detection data sets, were compiled a decade ago and
have failed to reflect the characteristics of modern computer
systems. In 2013, Australian Defence Force Academy Linux
Dataset has been released by the Australian Defence Force
Academy in University of New South Wale. In order to evaluate
host based intrusion detection system, ADFA dataset (Linux
dataset) was generated on a Ubuntu Linux 11.04 host OS with
Apache 2.2.17 running PHP 5.3.5. FTP, SSH, MySQL 14.14,
and TikiWiki were started. This dataset involves normal and
attack Linux based system calls traces. When a sampling stage,
the host that is configured to represent a modern Linux server
captures the system call traces where legitimate programs are
operated as usual. Subsequently, the cyber-attacks, i.e., Hydra-
FTP, HydraSSH, Adduser, Java-Meterpreter, Meter-preter and
Webshell, are launched in turn against the host, each of which
results in 8-20 abnormal traces. Table III. has shown the
composition of ADFD-LD
Table III. The composition of ADFD-LD
Trace Type
Number
Label
Training
833
normal
Validation
4373
normal
Hydra-FTP
162
attack
Hydra-SSH
148
attack
Adduser
91
attack
Java-Meterpreter
125
attack
Meterpreter
75
attack
Webshell
118
attack
The aim of ADFA dataset is to take the place of existing
benchmark data sets, because these benchmark datasets have
failed to reflect the characteristics of modern computer systems.
http://www.mdpi.com/1999-5903/8/3/29/htm
G. UNSW-NB15 Dataset
UNSW-NB 15 data set was created by the IXIA PerfectStorm
tool in the Cyber Range Lab of the Australian Centre for Cyber
Security (ACCS). This dataset contains approximately one hour
of anonymized traffic traces from a DDoS attack in 2007 [35-
39].
This dataset represent nine types of major attacks that are
Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic,
Reconnaissance, Shellcode and Worms. In order to classify this
dataset, IXIA PerfectStorm tool has achieved report from the
attack data. Table IV illustrates types of modern attack in this
dataset.
Table IV. Features of UNSW-NB15 Dataset
Category
Traning set
Testing set
Normal
56.000
37000
Analysis
2.000
677
Backdoor
1.746
583
DoS
12.264
4089
Exploits
33.393
11.132
Fuzzers
18.184
6.062
Generic
40.000
18.871
Reconnaissance
10.491
3.496
Shellcode
1.133
378
Worms
130
44
Total Records
175.341
82.332
There are 49 features in this dataset. In order to extract
features, Argus, Bro-IDS tools were used and 12 models were
developed. Features are categorized into only five groups that
are flow features, basic features, content features, time features
and additional generated features. Compared to existing dataset,
this dataset has several attack families that ultimately reflect
modern low foot print attacks [40].
VI. CONCLUSION
The protection of computer systems from cyber-attacks is one
of the main issues for national and international security.
Various researches have been conducted using several datasets
and also artificial intelligent and machine learning play a
significant role in protection of computer systems .In this paper,
we have outlined a comprehensive classes of various datasets
along with their advantages and disadvantages. In the future, we
are going to plan generating a new dataset and make it publicly
available.
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers for their
constructive comments and valuable suggestions.
REFERENCES
[1] DUA, Sumeet; DU, Xian. Data mining and machine learning in
cybersecurity. CRC press, 2016.
[2] Canongia, Claudia, and Raphael Mandarino Jr. "Cybersecurity: The New
Challenge of the." Handbook of Research on Business Social Networking:
Organizational, Managerial, and Technological Dimensions:
Organizational, Managerial, and Technological Dimensions (2011): 165.
[3] TWOMEY, Paul. Cyber Security Threats. 2010.
[4] Von Solms, Rossouw, and Johan Van Niekerk. "From information
security to cyber security." computers & security 38 (2013): 97-102.
[5] Fraley, James B., and James Cannady. "The promise of machine learning
in cybersecurity." SoutheastCon, 2017. IEEE, 2017.
[6] https://www.symantec.com/content/dam/symantec/docs/other-
resources/web-application-firewall-owasp-top-10-2017-coverage-en.pdf
[7] Buczak, Anna L., and Erhan Guven. "A survey of data mining and
machine learning methods for cyber security intrusion detection." IEEE
Communications Surveys & Tutorials 18.2 (2016): 1153-1176.
[8] Thuraisingham, Bhavani, et al. "Data mining for security applications."
Embedded and Ubiquitous Computing, 2008. EUC'08. IEEE/IFIP
International Conference on. Vol. 2. IEEE, 2008.
[9] Meshram, Ankush, and Christian Haas. "Anomaly detection in industrial
networks using machine learning: a roadmap." Machine Learning for
Cyber Physical Systems. Springer Berlin Heidelberg, 2017. 65-72.
[10] Feily, Maryam, Alireza Shahrestani, and Sureswaran Ramadass. "A
survey of botnet and botnet detection." Emerging Security Information,
Systems and Technologies, 2009. SECURWARE'09. Third International
Conference on. IEEE, 2009.
[11] Salem, Malek Ben, Shlomo Hershkop, and Salvatore J. Stolfo. "A survey
of insider attack detection research." Insider Attack and Cyber Security
(2008): 69-90.
[12] Chowdhury, Sudipta, et al. "Botnet detection using graph-based feature
clustering." Journal of Big Data 4.1 (2017): 14.
[13]
[14] Neethu, B. (2013). Adaptive Intrusion Detection Using Machine
Learning. International Journal of Computer Science and Network
Security (IJCSNS), 13(3), 118.
[15] Kozik, R., Choraś, M., Renk, R., & Hołubowicz, W. (2014, November).
A Proposal of Algorithm for Web Applications Cyber Attack Detection.
In IFIP International Conference on Computer Information Systems and
Industrial Management (pp. 680-687). Springer, Berlin, Heidelberg
[16] NGUYEN, Hai Thanh; FRANKE, Katrin. Adaptive Intrusion Detection
System via online machine learning. In: Hybrid Intelligent Systems (HIS),
2012 12th International Conference on. IEEE, 2012. p. 271-277.
[17] Xie, Miao, Jiankun Hu, and Jill Slay. "Evaluating host-based anomaly
detection systems: Application of the one-class svm algorithm to adfa-ld."
Fuzzy Systems and Knowledge Discovery (FSKD), 2014 11th
International Conference on. IEEE, 2014.
[18] Zamani, Mahdi, and Mahnush Movahedi. "Machine learning techniques
for intrusion detection." arXiv preprint arXiv:1312.2177 (2013).
[19] Hoque, M. S., Mukit, M., Bikas, M., & Naser, A. (2012). An
implementation of intrusion detection system using genetic algorithm.
arXiv preprint arXiv:1204.1336.
[20] Wang, Jing, and Ioannis Ch Paschalidis. "Botnet detection based on
anomaly and community detection." IEEE Transactions on Control of
Network Systems 4.2 (2017): 392-404.
[21] Bhuyan, Monowar H., Dhruba K. Bhattacharyya, and Jugal K. Kalita.
"Towards Generating Real-life Datasets for Network Intrusion
Detection." IJ Network Security 17.6 (2015): 683-701.
[22] Wijesinghe, Udaya, Udaya Tupakula, and Vijay Varadharajan. "An
enhanced model for network flow based botnet detection." Proceedings of
the 38th Australasian Computer Science Conference (ACSC 2015). Vol.
27. 2015.
[23] Haddadi, F., Le Cong, D., Porter, L., & Zincir-Heywood, A. N. (2015,
April). On the Effectiveness of Different Botnet Detection Approaches.
In ISPEC (pp. 121-135).
[24] Bhuyan, Monowar H., D. K. Bhattacharyya, and Jugal K. Kalita. "An
empirical evaluation of information metrics for low-rate and high-rate
DDoS attack detection." Pattern Recognition Letters 51 (2015): 1-7.
[25] Hoque, Nazrul, Dhruba K. Bhattacharyya, and Jugal K. Kalita. "A novel
measure for low-rate and high-rate DDoS attack detection using
multivariate data analysis." Communication Systems and Networks
(COMSNETS), 2016 8th International Conference on. IEEE, 2016.
[26] Kato, Keisuke, and Vitaly Klyuev. "An Intelligent DDoS Attack
Detection System Using Packet Analysis and Support Vector Machine."
IJICR (2014): 478-485.
[27] Sherif Saad, Issa Traore, Ali A. Ghorbani, Bassam Sayed, David Zhao,
Wei Lu, John Felix, Payman Hakimian, "Detecting P2P botnets through
network behavior analysis and machine learning", Proceedings of 9th
Annual Conference on Privacy, Security and Trust (PST2011), July 19-
21, 2011, Montreal, Quebec, Canada.
[28] Bhamare, D., Salman, T., Samaka, M., Erbad, A., & Jain, R. (2016,
December). Feasibility of Supervised Machine Learning for Cloud
Security. In Information Science and Security (ICISS), 2016 International
Conference on (pp. 1-5). IEEE.
[29] GALLAGHER, Brian; ELIASSI-RAD, Tina. Classification of http
attacks: a study on the ECML/PKDD 2007 discovery challenge.
Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2009.
[30] YUSOF, Ahmad Riza’ain; UDZIR, Nur Izura; SELAMAT, Ali. An
Evaluation on KNN-SVM Algorithm for Detection and Prediction of
DDoS Attack. In: International Conference on Industrial, Engineering and
Other Applications of Applied Intelligent Systems. Springer International
Publishing, 2016. p. 95-102.
[31] Kumar, P. Arun Raj, and S. Selvakumar. "Detection of distributed denial
of service attacks using an ensemble of adaptive and hybrid neuro-fuzzy
systems." Computer Communications 36.3 (2013): 303-319.
[32] C. Torrano-Gimenez, H. Nguyen, G. Álvarez, S. Petrovic, K. Franke,
Applying Feature Selection to Payload-Based Web Application Firewalls.
In Proc. of International Workshop on Security and Communication
Networks (IWSCN 11), pp. 75-81. Editor Patric Bours. Gjøvic (Noruega).
ISBN: 978-82-91313-67-2. 18-20 Mayo (2011).
[33] Hasan, M., Nasser, M., Ahmad, S. and Molla, K. (2016). Feature
Selection for Intrusion Detection Using Random Forest. Journal of
Information Security, 07(03); 129-140.
[34] Panda, Mrutyunjaya, Ajith Abraham, and Manas Ranjan Patra. "A hybrid
intelligent approach for network intrusion detection." Procedia
Engineering 30 (2012): 1-9.
[35] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. Kendall, D.
McClung, D. Webber, S. Webster, D. Wyschograd, R. Cunninghan, and
M. Zissman. Evaluating Intrusion Detection Systems: The 1998 DARPA
off-line intrusion detection evaluation. In Proc. of DARPA Information
Survivability Conference and Exposition (DISCEX00), Hilton Head,
South Carolina, January 25-27. IEEE Computer Society Press, Los
Alamitos, CA, 1226 (2000).
[36] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba and K. Das. The 1999
DARPA Off-Line Intrusion Detection Evaluation. In Proc. Recent
Advances in Intrusion Detection (RAID2000). H. Debar, L. Me, and S. F.
Wu, Eds. Springer-Verlag, New York, NY, 162182 (2000).
[37] J. McHugh. Testing Intrusion Detection Systems: A Critique of the 1998
and 1999 DARPA Intrusion Detection System Evaluations as Performed
by Lincoln Laboratory. In Proc. of ACM Transactions on Information and
System Security (TISSEC) 3(4), pp. 262-294 (2000).
[38] A. Perez-Villegas, C. Torrano-Gimenez, G. Alvarez. Applying Markov
Chains to Web IntrusionDetection. In Proc. of Reunión Española sobre
Criptología y Seguridad de la Información (RECSI 2010), pp. 361-366.
Publicaciones urv. Tarragona (España), 7-10 Septiembre (2010).
[39] C. Torrano-Gimenez, A. Perez-Villegas, G. Alvarez. An anomaly-based
approach for intrusion detection in web traffic. Journal of Information
Assurance and Security, vol. 5, issue 4, pp. 446-454. ISSN 1554-1010
(2010).
[40] C. Torrano-Gimenez, A. Perez-Villegas, G. Alvarez, A Self-Learning
Anomaly-Based Web Application Firewall. In Proc. of 2nd International
Workshop in Computational Intelligence in Security for Information
Systems (CISIS 09). Advances in Intelligent and Soft Computing, vol. 63,
pp. 85-92, Springer-Verlag. A. Herrero, P. Gastaldo, R. Zunino, E.
Corchado, editores. Burgos (España), 23-26 Septiembre (2009).
[41] C. Torrano-Gimenez, A. Perez-Villegas, G. Alvarez, An Anomaly-based
Web Application Firewall. In Proc. of International Conference on
Security and Cryptography (SECRYPT 2009), pp. 23-28. INSTICC Press.
E. Fernández-Medina, M. Malek, J. Hernando, editores. Milán (Italia), 7-
10 Julio (2009).
[42] H. Nguyen, C. Torrano-Gimenez, G. Álvarez, S. Petrovic, K. Franke,
Application of the Generic Feature Selection Measure in Detection of
Web Attacks. In Proc. of International Workshop in Computational
Intelligence in Security for Information Systems (CISIS 11 ), LNCS 6694,
pp. 25–32. Editor Á. Herrero and E. Corchado, Springer-Verlag.
Torremolinos, Málaga (España), Junio (2011).
[43] C. Torrano-Gimenez, H. Nguyen, G. Álvarez, S. Petrovic, K. Franke,
Applying Feature Selection to Payload-Based Web Application Firewalls.
In Proc. of International Workshop on Security and Communication
Networks (IWSCN 11), pp. 75-81. Editor Patric Bours. Gjøvic (Noruega).
ISBN: 978-82-91313-67-2. 18-20 Mayo (2011).
[44] Chinotec Technologies Company: Paros - for web application security
assessment. http://www.parosproxy.org/index.shtml (2004).
[45] Andrés Riancho: Web Application Attack and Audit Framework.
http://w3af.sourceforge.net (2007).
... The choice of dataset is of vital importance [14] as different datasets contain labels for different attacks. In addition, some of the most conventional cyber attack datasets are simply outdated or have far too many repeated results [15]. ...
... The choice of dataset for training a model is a key aspect within Machine Learning. [15] analyses the most common datasets used for cyber attack detection; mainly KDD99CUP (KDDCUP), ECML-PKDD 2007, ISOT, HTTP CSIC 2010, CTU-13, ADFA, UNSW-NB15 and suggests the types of attacks they aim to classify plus their respective train and validation size, and analyze their strengths and weakness. ...
... For CICIDS-2017 (DoS detection), 4 papers were reviewed. Both [36] and [37] use Information Gain for feature selection, ending with 15,22,35,52,57, 77 (all) and 16 features respectively, testing RF, NB, Random Tree (RT), NB, DT [36] and Multiple Linear Regression (MLR) [37]. [36] achieves 99,83% accuracy using RF with 22 features, while the MLP developed in [37] reached 97,86%. ...
Article
Full-text available
— Cyber-attacks have been a major problem ever since the rapid growth of technology experimented in the 20th century. Such attacks suppose a great problem as the threats involved are extremely costly. A tool for battling cyber attacks is Machine learning (ML), as it provides great effectiveness for anomaly detection. In this review, research is conducted about different ML algorithms’ performance for attack detection, since the first step to counter an attack is awareness. This review also attempts to understand the effectiveness of these algorithms, and assess common characteristics between approaches, throwing some guideless for newcomers emerged in the field of cybersecurity.
... The field of artificial intelligence known as "machine learning" is concerned with creating programs that can recognize patterns in data and forecast them [15]. To solve issues that are too complicated for traditional programming techniques to handle, machine learning algorithms learn from data. ...
... Face recognition is a biological identification method that recognizes people based on the distinctive features of their faces. Convolutional neural networks (CNNs) with deep learning capabilities are the most widely used machine learning algorithms for facial recognition [15]. ...
Article
Full-text available
This paper focuses on detecting the human gender using Convolutional Neural Network (CNN). Using CNN, a deep learning technique used as a feature extractor that takes input photos and gives values to various characteristics of the image and differentiates between them, the goal is to create and develop a real-time gender detection model. The model focuses on classifying human gender only into two different categories; male and female. The major reason why this work was carried out is to solve the problem of imposture. A CNN model was developed to extract facial features such as eyebrows, cheek bone, lip, nose shape and expressions to classify them into male and female gender, and also use demographic classification analysis to study and detect the facial expression. We implemented both machine learning algorithms and image processing techniques, and the Kaggle dataset showed encouraging results.
... Created by threat agents, malware takes various forms such as worms, viruses, trojans, ransomware, spyware, and adware, with the intent to compromise victims' devices and extract nancial gains (Sahin & Bahtiyar, 2020; Alomari et al., 2023; Sahin & Bahtiyar, 2020). Malware can be activated through user actions like clicking on links or pop-ups, often disguising itself as legitimate programs offering enticing features such as le conversion or storage (Yavanoglu & Aydos, 2017). As technology advances, malware has adapted to infect different operating systems, including Windows, Apple, Android, and cloud systems, causing symptoms like decreased performance, crashes, and unexpected pop-ups. ...
Preprint
Full-text available
The study addresses the pressing need for improved malware detection in cybersecurity, leveraging a novel approach that combines deep learning with feature selection techniques. By analyzing network traffic patterns, the research aims to distinguish between benign and malicious activities, crucial for protecting networks and devices from cyber threats. Through training and evaluating Dense and LSTM neural network models on extensive malware datasets, the study demonstrates a significant enhancement in detection accuracy, surpassing previous methodologies by 0.98%. This innovative method not only provides valuable insights for developers but also contributes to advancing the capabilities of malware detection systems, ultimately bolstering cybersecurity defenses against evolving threats.
... Moreover, in some cases, the more complex the scenario is, the more difficult the generalization becomes. This factor is so impactful for training ML models that there are researchers who dedicate their studies to proposing databases for specific tasks, such as cyber security [30], graphs [31], visual datasets [32], COVID-19 prediction [33], biomedicine [34], just to name a few. Some of them are useful for UAV navigation. ...
Article
Full-text available
Unmanned aerial vehicles (UAVs) are a valuable source of data for a wide range of real-time applications, due to their functionality, availability, adaptability, and maneuverability. Working as mobile sensors, they can provide a cost-effective solution for extremely complex tasks, such as inspection, air-to-ground communications, search and rescue, surveillance, among others. Nevertheless, the robots needs to navigate in quite distinct environments and in different dynamism levels, usually facing unpredicted situations, very often using limited sensing and computing capabilities. A large number of solutions to this problem has been featured by the scientific community in the last years, some of them based on machine-learning (ML) methods. Due to its great capability to deal with big data and complexity, as well as its speedy and high-accuracy processing, the ML framework has been used to improve existing technologies and control techniques. In this context, its adoption in several UAV navigation strategies is expected to provide solutions for various problems where UAVs are used in real-time applications. Thus, in order to contextualize the most recent advances, this work provides a detailed survey of relevant researches in which ML techniques have been used in UAV navigation to improve some functional aspects, such as energy-efficiency, communication, execution time, resource management, obstacle avoidance, and path planning.
... There are several key benefits to use ML in cybersecurity [17]- [21]: ...
Article
Full-text available
Given the continually rising frequency of cyberattacks, the adoption of artificial intelligence methods, particularly Machine Learning (ML), Deep Learning (DL), and Reinforcement Learning (RL), has become essential in the realm of cybersecurity. These techniques have proven to be effective in detecting and mitigating cyberattacks, which can cause significant harm to individuals, organizations, and even countries. Machine learning algorithms use statistical methods to identify patterns and anomalies in large datasets, enabling security analysts to detect previously unknown threats. Deep learning, a subfield of ML, has shown great potential in improving the accuracy and efficiency of cybersecurity systems, particularly in image and speech recognition. On the other hand, RL is again a subfield of machine learning that trains algorithms to learn through trial and error, making it particularly effective in dynamic environments. We also evaluated the usage of ChatGPT-like AI tools in cyber-related problem domains on both sides, positive and negative. This article provides an overview of how ML, DL, and RL are applied in cybersecurity, including their usage in malware detection, intrusion detection, vulnerability assessment, and other areas. The state-of-the-art studies using ML, DL, and RL models are evaluated in each section based on the main idea, techniques, and important findings. It also discusses these techniques’ challenges and limitations, including data quality, interpretability, and adversarial attacks. Overall, the use of ML, DL, and RL in cybersecurity holds great promise for improving the effectiveness of security systems and enhancing our ability to protect against cyberattacks. However, it is essential to continue developing and refining these techniques to address the ever-evolving nature of cyber threats. Besides, some promising solutions that rely on machine learning, deep learning, and reinforcement learning are susceptible to adversarial attacks, underscoring the importance of factoring in this vulnerability when devising countermeasures against sophisticated cyber threats. We also concluded that ChatGPT can be a valuable tool for cybersecurity, but it should be noted that ChatGPT-like tools can also be manipulated to threaten the integrity, confidentiality, and availability of data.
Conference Paper
Cognitive intelligence is intelligence exhibited by the process of acquiring knowledge and understanding through sense and thought processes. Cognitive intelligence in cybersecurity includes understanding the prospect of protection of digital devices, computer networks, websites, and cloud computing from both external and internal threats. Cyber threats like phishing, smishing, and data theft cause threats to business processes. In today’s day and age mobile, social networking and IoT are generating huge amounts of data. Analysis of the large volume of data helps in identifying the different attack vectors and their behavior pattern. It further can exceed the cognitive skills of security systems, rules, and protocols in organizations to keep the system more secure. Security analysts are now gathering huge amounts of data on users, networks, and transaction details. The captured data may reveal weak spots, hidden patterns, or disguised malicious activities. Evaluating and contemplating the underlying meaning of data is time-consuming and can be missed easily. In this paper, we are suggesting a few machine learning-based approaches to find the correlation between the data. This paper presents a cognitive classifier model that integrates machine learning with the data obtained from networks, web servers, logs, and cognitive processes to generate insight and knowledge which can be used to make the response actions better and more secure.
Conference Paper
Industry 4.0 is no longer just a futuristic idea but is gradually becoming a reality, along with its key components, IIoT and IIoS. Industrial 4.0 introduces IIoT and IIoS technologies to improve manufacturing and production processes. IIoS and IIoT collect sensitive and proprietary data for automation and analytics, which makes industrial automated systems an attractive target for malicious attacks. Therefore, adequate safeguards and robust security controls are required for IIoS. Numerous machine learning strategies have been intensively studied to develop efficient and intelligent security measures. However, most of the current ML-based strategies experience difficulties when used on real-life operational grounds. The difficulties include uncertainty in determining the most appropriate ML approach based on an industrial model. Moreover, these challenges are aggravated by poor performance on inconsistent datasets and excessive rates of false positives. Therefore, the purpose of this study is to investigate the performance of various Machine Learning algorithms, including supervised learning (Nave Bayes), ensemble learning (AdaBoost), neural networks (MLP), and hybrid algorithm (ABNB), in detecting malicious attacks on three different datasets.
Conference Paper
Full-text available
Cloud computing is gaining significant attention, however, security is the biggest hurdle in its wide acceptance. Users of cloud services are under constant fear of data loss, security threats and availability issues. Recently, learning-based methods for security applications are gaining popularity in the literature with the advents in machine learning techniques. However, the major challenge in these methods is obtaining real-time and unbiased datasets. Many datasets are internal and cannot be shared due to privacy issues or may lack certain statistical characteristics. As a result of this, researchers prefer to generate datasets for training and testing purpose in the simulated or closed experimental environments which may lack comprehensiveness. Machine learning models trained with such a single dataset generally result in a semantic gap between results and their application. There is a dearth of research work which demonstrates the effectiveness of these models across multiple datasets obtained in different environments. We argue that it is necessary to test the robustness of the machine learning models, especially in diversified operating conditions, which are prevalent in cloud scenarios. In this work, we use the UNSW dataset to train the supervised machine learning models. We then test these models with ISOT dataset. We present our results and argue that more research in the field of machine learning is still required for its applicability to the cloud security.
Article
Full-text available
Detecting botnets in a network is crucial because bots impact numerous areas such as cyber security, finance, health care, law enforcement, and more. Botnets are becoming more sophisticated and dangerous day-by-day, and most of the existing rule based and flow based detection methods may not be capable of detecting bot activities in an efficient and effective manner. Hence, designing a robust and fast botnet detection method is of high significance. In this study, we propose a novel botnet detection methodology based on topological features of nodes within a graph: in degree, out degree, in degree weight, out degree weight, clustering coefficient, node betweenness, and eigenvector centrality. A self-organizing map clustering method is applied to establish clusters of nodes in the network based on these features. Our method is capable of isolating bots in clusters of small sizes while containing the majority of normal nodes in the same big cluster. Thus, bots can be detected by searching a limited numbe
Article
Full-text available
Nowadays, many companies and/or governments require a secure system and/or an accurate intrusion detection system (IDS) to defend their network services and the user’s private information. In network security, developing an accurate detection system for distributed denial of service (DDoS) attacks is one of challenging tasks. DDoS attacks jam the network service of the target using multiple bots hijacked by crackers and send numerous packets to the target server. Servers of many companies and/or governments have been victims of the attacks. In such an attack, detecting the crackers is extremely difficult, because they only send a command by multiple bots from another network and then leave the bots quickly after command execute. The proposed strategy is to develop an intelligent detection system for DDoS attacks by detecting patterns of DDoS attack using network packet analysis and utilizing machine learning techniques to study the patterns of DDoS attacks. In this study, we analyzed large numbers of network packets provided by the Center for Applied Internet Data Analysis and implemented the detection system using a support vector machine with the radial basis function (Gaussian) kernel. The detection system is accurate in detecting DDoS attacks.
Conference Paper
Full-text available
Recently, damage caused by DDoS attacks increases year by year. Along with the advancement of communication technology, this kind of attack also evolves and it has become more complicated and hard to detect using flash crowd agent, slow rate attack and also amplification attack that exploits a vulnerability in DNS server. Fast detection of the DDoS attack, quick response mechanisms and proper mitigation are a must for an organization. An investigation has been performed on DDoS attack and it analyzes the details of its phase using machine learning technique to classify the network status. In this paper, we propose a hybrid KNN-SVM method on classifying, detecting and predicting the DDoS attack. The simulation result showed that each phase of the attack scenario is partitioned well and we can detect precursors of DDoS attack as well as the attack itself.
Article
Full-text available
An intrusion detection system collects and analyzes information from different areas within a computer or a network to identify possible security threats that include threats from both outside as well as inside of the organization. It deals with large amount of data, which contains various irrelevant and redundant features and results in increased processing time and low detection rate. Therefore, feature selection should be treated as an indispensable pre-processing step to improve the overall system performance significantly while mining on huge datasets. In this context, in this paper, we focus on a two-step approach of feature selection based on Random Forest. The first step selects the features with higher variable importance score and guides the initialization of search process for the second step whose outputs the final feature subset for classification and interpretation. The effectiveness of this algorithm is demonstrated on KDD’99 intrusion detection datasets, which are based on DARPA 98 dataset, provides labeled data for researchers working in the field of intrusion detection. The important deficiency in the KDD’99 data set is the huge number of redundant records as observed earlier. Therefore, we have derived a data set RRE-KDD by eliminating redundant record from KDD’99 train and test dataset, so the classifiers and feature selection method will not be biased towards more frequent records. This RRE-KDD consists of both KDD99Train+ and KDD99Test+ dataset for training and testing purposes, respectively. The experimental results show that the Random Forest based proposed approach can select most important and relevant features useful for classification, which, in turn, reduces not only the number of input features and time but also increases the classification accuracy.
Article
Full-text available
We introduce a novel two-stage approach for the important cybersecurity problem of detecting the presence of a botnet and identifying the compromised nodes (the bots), ideally before the botnet becomes active. The first stage detects anomalies by leveraging large deviations of an empirical distribution. We propose two approaches to create the empirical distribution: 1) a flow-based approach estimating the histogram of quantized flows and 2) a graph-based approach estimating the degree distribution of node interaction graphs, encompassing both Erdős-Rényi graphs and scale-free graphs. The second stage detects the bots using ideas from social network community detection in a graph that captures correlations of interactions among nodes over time. Community detection is performed by maximizing a modularity measure in this graph. The modularity maximization problem is nonconvex. We propose a convex relaxation, an effective randomization algorithm, and establish sharp bounds on the suboptimality gap. We apply our method to real-world botnet traffic and compare its performance with other methods.
Chapter
With the advent of 21st Century, we stepped into the fourth industrial revolution of cyber physical systems. There is the need of secured network systems and intrusion detection systems in order to detect network attacks. Use of machine learning for anomaly detection in industrial networks faces challenges which restricts its large-scale commercial deployment. ADIN Suite proposes a roadmap to overcome these challenges with multi-module solution. It solves the need for real world network traffic, an adaptive hybrid analysis to reduce error rates in diverse network traffic and alarm correlation for semantic description of detection results to the network operator.
Conference Paper
Injection attacks (e.g. XSS or SQL) are ranked at the first place in world-wide lists (e.g. MITRE and OWASP). These types of attacks can be easily obfuscated. Therefore it is difficult or even impossible to provide a reliable signature for firewalls that will detect such attacks. In this paper, we have proposed an innovative method for modelling the normal behaviour of web applications. The model is based on information obtained from HTTP requests generated by a client to a web server. We have evaluated our method on CSIC 2010 HTTP Dataset achieving satisfactory results. © IFIP International Federation for Information Processing 2014.