Conference PaperPDF Available

Developing Realistic Distributed Denial of Service (DDoS) Dataset for Machine Learning-based Intrusion Detection System

Authors:

Abstract and Figures

During the last decade, attackers have compromised reputable systems to launch massive Distributed Denial of Services (DDoS) attacks against banking services, corporate websites, and e-commerce business. Such attacks cause enormous reputation and financial losses which ruined their services to authorized users. Conventionally, diverse solutions have been proposed to combat emerging DDoS attacks. However, there is no ideal solution available to-date. To validate majority of the existing solutions, researchers have considered simulation based experiments that become obsolete. Now a days, the trend has shifted to publicly available realistic datasets for DDoS validation purpose. Thus, in this research study, we have provided a comprehensive review of currently available datasets and proposed a novel taxonomy for classification of DDoS attacks. Further, we generated a new dataset called "CRCDDoS2022", which can overcome all existing shortcomings. Moreover, a novel generated dataset "CRCDDoS2022" can overcome drawback. Moreover, with this new dataset, a new attack (Malware) family classification and detection approach is also provided which based on the set of features in network flow. Lastly, this research provided the most significant feature sets for the detection of DDoS attacks of various types along with their corresponding weights.
Content may be subject to copyright.
Developing Realistic Distributed Denial of Service
(DDoS) Dataset for Machine Learning-based
Intrusion Detection System
Hassan Jalil Hadi 1,2Umer Hayat 1Numan Musthaq 1Faisal Bashir Hussain1Yue Cao2
1Dept. Cyber Reconnaissance and Combat (CRC) Lab, Bahria University, Islamabad, Pakistan
2School of Cyber Science and Engineering, Wuhan University, Wuhan, China
hassanjalilhadi1142@gmail.com, fbashir.buic@bahria.edu.pk , yue.cao@whu.edu.cn
Abstract—During the last decade, attackers have compro-
mised reputable systems to launch massive Distributed Denial
of Services (DDoS) attacks against banking services, corporate
websites, and e-commerce business. Such attacks cause enormous
reputation and financial losses which ruined their services to
authorized users. Conventionally, diverse solutions have been
proposed to combat emerging DDoS attacks. However, there
is no ideal solution available to-date. To validate majority of
the existing solutions, researchers have considered simulation
based experiments that become obsolete. Now a days, the trend
has shifted to publicly available realistic datasets for DDoS
validation purpose. Thus, in this research study, we have provided
a comprehensive review of currently available datasets and
proposed a novel taxonomy for classification of DDoS attacks.
Further, we generated a new dataset called “CRCDDoS2022”,
which can overcome all existing shortcomings. Moreover, a novel
generated dataset “CRCDDoS2022” can overcome drawback.
Moreover, with this new dataset, a new attack (Malware) family
classification and detection approach is also provided which based
on the set of features in network flow. Lastly, this research
provided the most significant feature sets for the detection of
DDoS attacks of various types along with their corresponding
weights.
Index Terms DDoS, MLIDS, AI, Network Traffic, Machine
Learning, Cyberattack
I. INT ROD UC TI ON
Currently, with the advent of interconnected devices through
the internet, the hovering threat of cyber-attacks cannot be
deniend, while DDoS has caused havoc over the years [1].
In the DDoS attacks, attacker produces a volumetric traffic
that exhausts the system resources in victim’s network. This is
normally started by one attacker who exploits and takes control
of multiple devices called zombies. These zombies’ devices
are unaware of the hidden system utilization for illegitimate
purposes. Normally, a sweep operation is performed by the
attacker to identify the devices that can becoming a zombie.
Hunting for devices with an open port is the prime target of
sweep operation, as these devices owns the best candidature to
become zombie. later on, the attacker uses zombie devices to
launch attack. The attacks detection turnsout to be challenging
as numerous zombie devices can grow to Identify applicable
funding agencies [2].
Different techniques have been presented for the prevention
of DDoS attacks; however, this is still a significant threat
to network security. The existing solutions are comprised of
anomaly-based and signaturebased techniques for intrusion
detection procedures [3]. Signature based Intrusion Prevention
systems (IPS) and Intrusion Detection System (IDS) play sig-
nificant role to defend against cyber-attacks but unfortunately,
most of these systems are not much effective for DDoS and
zero-day attacks. Modern research depicts that anomalybased
intrusion detection techniques are more effective to detect
intrusion as well as it has gained a lot of attention from
researchers in past few years. Contrarily, signature-based intru-
sion detection methods are comparatively easier to implement
but contain limitations when it comes to known signatures.
The anomaly-based approaches like Deep Learning and
Machine Learning are a subset of AI which can be used to
distinguish abnormal and benign traffic. Telecom manufac-
turers are currently paying more attention on anomaly-based
intrusion detection solutions due to their ability to effectively
detect cyber-attacks and advanced computing power. Palo-Alto
networks has launched a first ever Machine learning-based
IDS (MLIDS) in 2020. Although, the performance of MLIDS
is dependent on valuable datasets for training and numerous
network attacks might be detected with high learning accuracy.
II. AVAIL AB LE DATAS ET S
In this research publicly available DDoS attacks datasets
from 2000 to 2019 have been evaluated and the necessity of a
reliable and comprehensive dataset for validation and testing
of DDoS attacks detection systems has also been closely
examined.
To detect an anomaly, numerous intrusion detection systems
datasets are available. Such datasets are created with the help
of synthetic methods or simulations that cannot reflect real
attack scenario as well as are not created through complete
protocols stack. The first known dataset is called Defense
Advanced Research Project Agency (DARPA) created by Lin-
coln Laboratory in 1998 [11]. Later, researchers have created
many other datasets for example, DEFCON (2000-2002),
Knowledge Discovery and Data Mining (KDD99) (1998- 99),
DEFCON (2000-2002), Network Security Laboratory KDD
(2009), CAIDA 2000-2016 and Kyoto (2009). Each dataset has
its own limitations. The main limitations include composition
2022 9th International Conference on Internet of Things: Systems, Management and Security (IOTSMS) | 979-8-3503-2045-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/IOTSMS58070.2022.10062034
Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.
with low diverse traffic, data size, real-time and non-realistic
attack scenario.
Imran et al. has pointed out in detail the weaknesses
in these datasets [4] and developed a new dataset called
CICIDS2017. The dataset was developed to detect intrusion
and is comprehensive among all the datasets created so far.
They have also created DDoS2019 dataset [5] using the initial
work [4]. In this dataset, the complete network topology is
used to generate different attacks traffic as background traffic
for multiple days and taken the traffic to analyze it with
flow oriented CICFlowMeter. It is an analyzer for network
traffic and can generate 80 statistical attributes like number of
packets, packet length, duration, total bytes, etc., both in the
forward as well as backward direction.
The authors in [4], simulated the background traffic in the
network that consist of 25 users for 5 protocols including,
FTP, SSH, HTTP, HTTPS, and email protocol. This paper
has explored their cases of DDoS attacks particularly for
own interest and to develop a new real time DDoS dataset.
Authors here utilized LOIC to simulate DDoS attacks. The
main deficiency in this dataset remains with the raw network
traffic that is not real. Particularly, an actual DDoS attack
takes place when thousands of different attacks flood and
overwhelm the bandwidth as well as services. Developing
real time datasets means to create real environment, imitating
a wider range of client behavior, both malicious as well as
benign, comprising imitation of thousands of known attacks
with the help of real stateful traffic as well as applications.
Real-time scenarios are also included in the creation for
time varying and dynamic conditions. Most of the existing
DDoS datasets always have limitations for example dataset
developed by Iman et al. [5] In this project a realistic DDoS
was developed through modern simulators (HOIC, Slowloris,
DDoSIM, HULK, Goldeneye, Bonesi, Mirai Botnet, Tor Ham-
mers) which reduces the necessity of full network topology.
Subsequently, we performed the evaluation of our new dataset.
The evaluation has identified features that are crucial to detect
and evaluate the performance of available machine learning
algorithms. Along with that, the PCA analysis is performed
to reduce dimensionality. Using CRCFM2022 is a novel tech-
nique for developing realistic datasets. As per over in depth
analysis, this technique is not used in any existing dataset till
now.
This paper is organized as follows. Section III explains the
testbed, in Section IV Test Scenarios are given, Section V
provides results and analysis. Lastly, Section VI concludes the
research paper.
III. TES TB ED
Modern simulators (HOIC, Slowloris, DDoSIM, Mass,
Golden eye, Bonesi, Mirai Botnet, Tor Hammer) are used
for creating real world attack simulation. These simulators
are strong, simple to utilize and produces realistic application
traffic as well as attacks for testing performance, versatility
along with security of application oriented networks. These
simulators are the industry perceived world’s most powerful
performing test systems for OSI layer 3-7 arrangement, and
it imitates real application traffic. These have separate library
of thousands of practical applications as well as attack vectors
and are daily updated to guarantee load and utilitarian testing
with unparalleled versatility.
In DDoS attacks, attackers try to evade servers which causes
a lot of network bandwidth consumption due to ingress and
egress traffic. By utilizing these test systems, the network
architecture was simultaneous and simplified, a full network
topology was created, as recommended for advancement of
the effective dataset. Interestingly, these simulators can imitate
hundred thousand attacks every second from Mac addresses
and IP spoofing, with different protocols/traffic. The testbed
permits load balance for attacks and benign traffic. Through
such many choices, the test system produces real time traffic
through full convention stacks. Fig 1 presents the newly
developed testbed for DDoS attack scenarios.
The attacker group (172.16.221.108) and victim group
are two network groups utilized in this test system. Sim-
plified configurations were used in the victim network
such as Linux (172.16.221.133). Likewise, Windows servers
(172.16.221.133) with PCs (172.16.x.x) were used. SMTP,
DNS, and web services are provided by the Linux server.
In contrast, Windows server only provide web services. The
significant connectivity to the victim network from attack
group is acknowledged with 10Gbps fiber network for de-
veloping attack scenarios bearing maximum load to 10Gbps
approximately. The outgoing and incoming traffic is captured
through mirror port with the help of tcpdump. Table 1 presents
a list of workstations, servers and firewalls with their operating
system as well as associated private and public IPs in testing
and training days.
TABLE I
IPS AND TARGET NET WO RK OPE RATI NG SYS TE M
Machine OS Ips
Server Ubuntu 22.04
LTS (Web
Server)
172.16.221.133
Firewall pfSense 10.10.23.11
PCs (Training Day) Win 7 Home Ba-
sic
172.16.221.134
Win 8.1 pro 172.16.221.135
Win 10 Education 172.16.221.136
PCs (Testing Day) Win 7 Home Ba-
sic
172.16.221.137
Win 8.1 pro 172.16.221.138
Win 10 Education 172.16.221.139
The third party executed the assault families in testing and
training days. The Victim network comprises one server (Web
server), two switches, one firewall and four PCs. Likewise,
one port in the fundamental switch of the Victim network
is configured as the mirror port and dedicatedly captures all
incoming and outgoing traffic to a network.
Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Testbed
IV. TES T SCE NARIOS
Three types of DDoS attacks for the test include protocol
DDoS (PDDoS), volumetric DDoS (VDDoS) as well as ap-
plication layer attacks. The research focuses on the PDDoS
and VDDoS attack scenarios. The PDDoS is the large scale
DDoS attack based on protocols and designed to create state
exhaustion. Volumetric DDoS is also a larger scale data trans-
mission critical attack intended to affect network blockage.
On the other hand, VDDoS incorporates layers four and three
Flood attacks. These are typical ICMP flooding (type 0, 4-18),
twisted flooding utilizing multiple protocols like UDP, ICMP,
Raw IP floods, and TCP protocols. PDDoS incorporates layer-
4, the connection oriented RST, TCP SYN and ACK flooding
attack. This multitude of attacks was created for fostering
the DDoS dataset, special attention was paid to fundamental
flood attacks. Benign traffic was imitated by utilizing mixed
traffic and profile of 80 clients together (incorporating HTTP,
HTTPS, FTP, SSH SMTP, RTSP, POP3, BitTorrent, Telnet,
SIP, and IMAP).
A. Test Procedures
The tests were executed in accordance with the testbed
portrayed in Section 2 above. Each sort of traffic scenario is
created from simulators to servers of victims and tcpdump
collects information. PCAP format is used to capture raw
data. Whereas the simulators are utilized to create multiple
attackers with mock IP addresses and separate MAC addresses.
In this work, 5 days of simulations were used for the DDoS
attacks utilizing the LOIC recreation device and delivered 10-
70 GB of raw data, roughly. For launching realistic attacks
an exceptionally high load along with quick attacks was
utilized. The transfer speed was depleted through high load
in both inward and outward directions as well as captured
all communication from the attacks and victims’ networks
through mirror port. Almost thirty minutes of test time were
used for size equivalency of raw data, further details are
recorded in Table 2. Figure 2 shows a screen capture of the
benign mixed traffic from the network in which we have
Fig. 2. Benign Traffic
imitated 80 clients with different traffic profiles like HTTP,
SMTP, SSH, and so on. 1 2
V. ANALYS IS
A testbed was created with multi simulators (including
HOIC, Slowloris, DDoSIM, HULK, Goldeneye, Bonesi, Mirai
Botnet, Tor Hammers). Considering the recommendations pro-
vided [10], dataset was assessed to make it a valuable dataset.
To improve the quality of DDoS attack dataset, traffic was
produced from the attackers to the victim networks utilizing
streamlined yet full network topology, like displayed in figure
1. Moreover, real traffic was produced utilizing simulators
for targeted network with different components of network
comprising servers, PCs, firewalls, switches and routers. Later
incoming and outgoing traffic was captured for span of each
test from mirror port utilizing the tcpdump method. The
captured traffic shows the connection of network components
inside targeted network. Utilizing simulator’s capabilities, the
1Demo Video https://bit.ly/3rGfT1R.
2Dataset Link https://github.com/CRC-Center/CRCDDoS2022.
Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.
TABLE II
DATASET DESCRIPTION
Attacks Type Attack Time File Size Target
Device
Avg. load Attack no. of address
ICMP flood 24 hours 73.65 GB Ubuntu/Windows Server 1 Gbps 1500 MAC address with
1500 spoofed IP address
UDP Port 53 2 hours 5.20 GB Ubuntu Server 1 Gbps 1100 MAC address with
spoofed IP address 1100
Syn Port 25 2 hours 1.1GB Ubuntu Server 500 Mbps 1100 MAC address with
spoofed IP address 1100
Syn Port 80 2 hours 4.3 GB Ubuntu Server 500 Mbps 1100 MAC address with
spoofed IP address 1100
Syn Port 443 2 hours 4.1 GB Ubuntu Server 500 Mbps 1100 MAC address with
spoofed IP address 1100
Bots 5 hours 23 GB Ubuntu/Windows Server 800 Mbps 500 MAC address with
500 spoofed IP address
with
Port Map 3 hours 1 GB Ubuntu/Windows Server 200 Mbps 500 MAC address with
500 spoofed IP address
TFTP 2 hours 3 GB Ubuntu/Windows Server 400 Mbps 500 MAC address with
500 spoofed IP address
Benign Traffic 20 hours 21.5 GB Ubuntu/Windows Server 1 Gbps MAC address with 80
Profiles address 80
Fig. 3. PCA Analysis
real traffic that incorporates different protocols, including
the SSH protocols was produced. Similarly, different assault
vectors as well as normal traffic was produced as shown in
figure 3, which shows variety of benign traffic and attacks.
After that, the features were separated from the raw data
utilizing CICFlowmeter. The timestamps and different features
are available for labeled dataset that describe the traffic flow.
Hence, valuable DDoS dataset is created having over 5.6 mil-
lion streams. With CICflowmeter, 86 features were extricated
in the form of PCAP, from raw data. For detail of the separated
features, we propose reader survey [6]. We utilized 71 out
of 86 features for the underlying analysis. The full labeled
dataset incorporates more than 5.6 million records. For any
abnormality approach, this size of dataset is adequate for the
testing and training model. The dataset comprises of 30.57%
normal and 70.24% DDoS named streams. Figure 4, 5, 6, 7,
8, 9. presents the appropriation of different DDoS attacks as
well as normal traffic.
For analysis purpose the Scikit-learn library along with
Python programming was used. As enormous number of
features were involved, therefore, dimensionality reduction has
been performed as per Principal Component Analysis (PCA)
[8]. PCA is an extremely normal technique to reduce the
dimensionality of a bigger set of features into a more modest
containing greater part of the data in the enormous set to
envision better and perceive complexity. Simultaneously, it
makes novel Principal Components that holds the most data.
Figure 3 presents the PCA examination. We observed 27 PCs
hold 96% data. Figure 10 presents the position of the important
27 features for every PCA examination.
We utilized our created dataset to quantify the presenta-
tion of a couple of few common ML algorithms. These are
Quadratic Discriminant Analysis (QDA), Random Forest (RF),
Gaussian Na¨
ıve Bayes (GNB), Iterative Dichotomise 3 (ID3)
and Multi-layer Perceptron (MLP, a class of feed forward
ANN) [9], [10].
At first performance of above ML techniques was analyzed
utilizing full dataset, having 86 features and afterward utilized
a dataset with 27 features. Similar execution was acquired
for 27 features in view of the PCA examination. We likewise
Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Normal Traffic vs Day1 DDoS Attacks Fig. 5. Normal Traffic vs Day2 DDoS Attacks Fig. 6. Normal Traffics vs Day3 DDoS Attacks
Fig. 7. Normal Traffics Vs Day4 DDoS Attacks Fig. 8. Normal Traffic vs Day5 DDoS Attacks Fig. 9. Normal Vs DDoS Attacks
observe that performance with a larger number of tests gives
equivalent results to this multitude of algorithms and up to
98.5 - 100 percent precision. A typical inquiry is the size of
the data which can be used for testing. The answer indicates
thatonly dataset as well as algorithm’s complexity matters.
Mostly, it is observed that dataset is parted in a ratio of 80%
20% for training and testing, respectively. This is a result of the
restricted size of the dataset. An answer is found by utilizing
heuristic techniques to get the most appropriate size for the
training.
Observable changes were noted, and the size of the training
dataset was decreased. To quantify the exhibition of ML algo-
rithms, standard development measurements were used which
include Precision (Pr), F1 Measure or F1 score and Recall
(Rc). These were determined from the Confusion Matrix (CM)
on our dataset for each ML algorithms. CM represents the
presentation of a model of a dataset with known attribute.
Table III provides 50 percent training sets for each model;
however, precision of training is extremely high because of
the exceptionally larger size training dataset. The accuracy
tells the number of predictions made correctly by a model
and correct predictions to mean higher accuracy. The following
method can be used to calculate accuracy:
Accuracy =N o. of correct predictions
T otal number of P r edictions .(1)
The next parameter Precision can be calculated by ratio
between sum of true positive, true negative and true positive.
A better model shows high precision. Precision is calculated
as below:
P recision =T P
T P +F P .(2)
Another parameter is Recall. It is the sensitivity of a system.
It is calculated as below:
Recall =T P
T P +F N .(3)
The final parameter is the F-Score, where Recall and
Precision are used in calculation. It is weighted average of
Precision and Recall. The F-Score lies between 0 and 1. 1
tells that Recall and Precision are perfect. While 0 shows that
either Recall or Precision is 0. F1-Score is calculated as:
Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.
Fig. 10. Feature Ranking
F1 Score = (Recall)1+ (P recision)1
2
=2 ×Recall ×P recision
Recall +P recision
Albeit the performance shown by every technique is excep-
tionally near one another, the execution time has an effect.
The accuracy of detection for GNB and ID3 is almost 99.5%
to 100 percent, however best performance for per execution
time is given by GNB as presented in Table III.
GNB takes just 12.70 sec while MLP takes the most
noteworthy time, 2906.34 sec. In view of this examination,
we can utilize either GNB model or ID3 for continuous
DDoS identification execution since they need minimum time
for recognizing extremely high accuracy. Like, the accuracy
of training of GNB recorded almost 0.99585, which reflects
in precise detection numbers. It should be visible from the
Confusion Matrix for GNB. GNB accurately characterized all
DDoS assaults and just identified around 0.1% false positive
in DDoS identification weighted normal execution for fifty
percent training and fifty percent of testing i.e., 5.6 million
data. The description related to evolution matrix is given
underneath.
VI. CONCLUSION
A unique real dataset having more than 5.6 million data
samples for DDoS was generated with the help of multi
simulators in a network. The size of the sample is enough
to satisfy any anomaly detection mode for training efficiently
to detect aforementioned attacks. The attack diversity shown
by dataset was created by thousand spoofed attackers and with
10Gbps traffic. The research community will be able to access
the labeled dataset (CRCDDoS2022) and raw traffic created in
this project once it is finished. As the created dataset has larger
dimension, PCA analysis was performed, and dimensionality
was reduced from 86 to 27 features. However, this helped
to keep 96% of actual information in all features used. The
analysis of the commonly used ML algorithms indicated that
if quality dataset is used for test and train purpose then the
precision achieved may varies. The performance of all ML
algorithms used in this project can be compared using RC, Pr,
and F1 parameters. However, time required for execution can
be different. The lowest execution time was given by GNB
and the highest time was given by KNN. The use of multiple
simulators to develop this real dataset is novel approach. These
simulators are recognized by industry for validation and testing
and so far, not used to create any dataset available. The future
goal is the extension of this work to create a realistic IDS
system. 3 4
VII. ACKN OWLEDGEMENT
This work is supported in part by the Hubei Province Key
Research and Development Program (2021BAA027).
REFERENCES
[1] Jianxiong Li, Simon Kamin, Guoxing Zheng, Frank Neubrech, Shuang
Zhang, and Na Liu. Addressable metasurfaces for dynamic holography
and optical information encryption. Science advances, 4(6):eaar6768,
2018
[2] H. J. Hadi, S. M. Sajjad, and K. un Nisa, ”BoDMitM: Botnet detection
and mitigation system for home router base on MUD,” in 2019 Interna-
tional Conference on Frontiers of Information Technology (FIT), 2019,
pp. 139-1394: IEEE.
[3] K. M. Prasad, A. R. M. Reddy, K. V. J. G. J. o. C. S. Rao, and
Technology, ”DoS and DDoS attacks: defense, detection and traceback
mechanisms-a survey,” 2014
[4] I. Sharafaldin, A. H. Lashkari, and A. A. J. I. Ghorbani, ”Toward
generating a new intrusion detection dataset and intrusion traffic char-
acterization,” vol. 1, pp. 108-116, 2018.
[5] I. Sharafaldin, A. H. Lashkari, S. Hakak, and A. A. Ghorbani, ”De-
veloping realistic distributed denial of service (DDoS) attack dataset
and taxonomy, in 2019 International Carnahan Conference on Security
Technology (ICCST), 2019, pp. 1-8: IEEE.
[6] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani,
”Characterization of tor traffic using time based features,” in ICISSp,
2017, pp. 253-262.
[7] O. Kramer, ”Scikit-learn,” in Machine learning for evolution strategies:
Springer, 2016, pp. 45-53.
[8] G. James, D. Witten, T. Hastie, and R. J. S. Tibshirani, doi, ”An
introduction to statistical learning, edited by: Casella, G., Fienberg, S.,
and Olkin, I,” vol. 10, pp. 978-1, 2017
[9] J. R. J. M. l. Quinlan, ”Induction of decision trees,” vol. 1, no. 1, pp.
81-106, 1986.
[10] L. Breiman, ”Random forests machine learning, vol. 45,” ed: Oct, 2001.
[11] 2000 DARPA Intrusion Detection Scenario Specific Datasets
MIT Lincoln Laboratory”, Ll.mit.edu, 2022. [Online]. Available:
https://www.ll.mit.edu/rd/datasets/2000-darpa-intrusiondetection-
scenario-specific-datasets. [Accessed: 11-Apr- 2022]
3Demo Video https://bit.ly/3rGfT1R.
4Dataset Link https://github.com/CRC-Center/CRCDDoS2022.
Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.
Preprint
Full-text available
Within the Cyber realm, technological advances and connected devices have changed the world. The boom of smart devices (real-time objects) has provided users, businesses and organizations with an opportunity to central usability concepts with modernized operations. Today, the Internet of Things (IoT) is superseded by the Internet of Everything (IoE). The connected devices with sensors, for communication with other smart devices, users and Information Systems (IS) via the Internet pave way for hackers and attackers. The malicious users have a myriad of opportunities to compromise and conduct advanced attacks using smart devices, systems, and wireless sensors. It is an upheaval task to secure connections between systems, data, processes, and products in IoE. The research will be focused on security risks in IoE, implementation of security measures and authentication schemes to protect IoE.
Article
Full-text available
Metasurfaces enable manipulation of light propagation at an unprecedented level, benefitting from a number of merits unavailable to conventional optical elements, such as ultracompactness, precise phase and polarization control at deep subwavelength scale, and multifunctionalities. Recent progress in this field has witnessed a plethora of functional metasurfaces, ranging from lenses and vortex beam generation to holography. However, research endeavors have been mainly devoted to static devices, exploiting only a glimpse of opportunities that metasurfaces can offer. We demonstrate a dynamic metasurface platform, which allows independent manipulation of addressable subwavelength pixels at visible frequencies through controlled chemical reactions. In particular, we create dynamic metasurface holograms for advanced optical information processing and encryption. Plasmonic nanorods tailored to exhibit hierarchical reaction kinetics upon hydrogenation/dehydrogenation constitute addressable pixels in multiplexed metasurfaces. The helicity of light, hydrogen, oxygen, and reaction duration serve as multiple keys to encrypt the metasurfaces. One single metasurface can be deciphered into manifold messages with customized keys, featuring a compact data storage scheme as well as a high level of information security. Our work suggests a novel route to protect and transmit classified data, where highly restricted access of information is imposed.
Conference Paper
With exponential growth in the size of computer networks and developed applications, the significant increasing of the potential damage that can be caused by launching attacks is becoming obvious. Meanwhile, Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are one of the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of adequate dataset, anomaly-based approaches in intrusion detection systems are suffering from accurate deployment, analysis and evaluation. There exist a number of such datasets such as DARPA98, KDD99, ISC2012, and ADFA13 that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Based on our study over eleven available datasets since 1998, many such datasets are out of date and unreliable to use. Some of these datasets suffer from lack of traffic diversity and volumes, some of them do not cover the variety of attacks, while others anonymized packet information and payload which cannot reflect the current trends, or they lack feature set and metadata. This paper produces a reliable dataset that contains benign and seven common attack network flows, which meets real world criteria and is publicly available. Consequently, the paper evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Chapter
scikit-learn is an open source machine learning library written in Python.
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.