Conference PaperPDF Available

Developing Realistic Distributed Denial of Service (DDoS) Dataset for Machine Learning-based Intrusion Detection System

May 2023

May 2023

DOI:10.1109/IOTSMS58070.2022.10062034

Conference: IoTSMS 2022
At: italy

Authors:

Umar Hayat

Bahria University Islamabad Campus

Show all 5 authorsHide

During the last decade, attackers have compromised reputable systems to launch massive Distributed Denial of Services (DDoS) attacks against banking services, corporate websites, and e-commerce business. Such attacks cause enormous reputation and financial losses which ruined their services to authorized users. Conventionally, diverse solutions have been proposed to combat emerging DDoS attacks. However, there is no ideal solution available to-date. To validate majority of the existing solutions, researchers have considered simulation based experiments that become obsolete. Now a days, the trend has shifted to publicly available realistic datasets for DDoS validation purpose. Thus, in this research study, we have provided a comprehensive review of currently available datasets and proposed a novel taxonomy for classification of DDoS attacks. Further, we generated a new dataset called "CRCDDoS2022", which can overcome all existing shortcomings. Moreover, a novel generated dataset "CRCDDoS2022" can overcome drawback. Moreover, with this new dataset, a new attack (Malware) family classification and detection approach is also provided which based on the set of features in network flow. Lastly, this research provided the most significant feature sets for the detection of DDoS attacks of various types along with their corresponding weights.

Normal Traffic vs Day1 DDoS Attacks Fig. 5. Normal Traffic vs Day2 DDoS Attacks Fig. 6. Normal Traffics vs Day3 DDoS Attacks

…

Feature Ranking

…

Figures - uploaded by Engr Khaleeq Un Nisa

Content may be subject to copyright.

Content uploaded by Engr Khaleeq Un Nisa

Content may be subject to copyright.

Developing Realistic Distributed Denial of Service

(DDoS) Dataset for Machine Learning-based

Intrusion Detection System

Hassan Jalil Hadi 1,2Umer Hayat 1Numan Musthaq 1Faisal Bashir Hussain1Yue Cao2∗

1Dept. Cyber Reconnaissance and Combat (CRC) Lab, Bahria University, Islamabad, Pakistan

2School of Cyber Science and Engineering, Wuhan University, Wuhan, China

hassanjalilhadi1142@gmail.com, fbashir.buic@bahria.edu.pk , yue.cao@whu.edu.cn

Abstract—During the last decade, attackers have compro-

mised reputable systems to launch massive Distributed Denial

of Services (DDoS) attacks against banking services, corporate

websites, and e-commerce business. Such attacks cause enormous

reputation and ﬁnancial losses which ruined their services to

authorized users. Conventionally, diverse solutions have been

proposed to combat emerging DDoS attacks. However, there

is no ideal solution available to-date. To validate majority of

the existing solutions, researchers have considered simulation

based experiments that become obsolete. Now a days, the trend

has shifted to publicly available realistic datasets for DDoS

validation purpose. Thus, in this research study, we have provided

a comprehensive review of currently available datasets and

proposed a novel taxonomy for classiﬁcation of DDoS attacks.

Further, we generated a new dataset called “CRCDDoS2022”,

which can overcome all existing shortcomings. Moreover, a novel

generated dataset “CRCDDoS2022” can overcome drawback.

Moreover, with this new dataset, a new attack (Malware) family

classiﬁcation and detection approach is also provided which based

on the set of features in network ﬂow. Lastly, this research

provided the most signiﬁcant feature sets for the detection of

DDoS attacks of various types along with their corresponding

weights.

Index Terms— DDoS, MLIDS, AI, Network Trafﬁc, Machine

Learning, Cyberattack

I. INT ROD UC TI ON

Currently, with the advent of interconnected devices through

the internet, the hovering threat of cyber-attacks cannot be

deniend, while DDoS has caused havoc over the years [1].

In the DDoS attacks, attacker produces a volumetric trafﬁc

that exhausts the system resources in victim’s network. This is

normally started by one attacker who exploits and takes control

of multiple devices called zombies. These zombies’ devices

are unaware of the hidden system utilization for illegitimate

purposes. Normally, a sweep operation is performed by the

attacker to identify the devices that can becoming a zombie.

Hunting for devices with an open port is the prime target of

sweep operation, as these devices owns the best candidature to

become zombie. later on, the attacker uses zombie devices to

launch attack. The attacks detection turnsout to be challenging

as numerous zombie devices can grow to Identify applicable

funding agencies [2].

Different techniques have been presented for the prevention

of DDoS attacks; however, this is still a signiﬁcant threat

to network security. The existing solutions are comprised of

anomaly-based and signaturebased techniques for intrusion

detection procedures [3]. Signature based Intrusion Prevention

systems (IPS) and Intrusion Detection System (IDS) play sig-

niﬁcant role to defend against cyber-attacks but unfortunately,

most of these systems are not much effective for DDoS and

zero-day attacks. Modern research depicts that anomalybased

intrusion detection techniques are more effective to detect

intrusion as well as it has gained a lot of attention from

researchers in past few years. Contrarily, signature-based intru-

sion detection methods are comparatively easier to implement

but contain limitations when it comes to known signatures.

The anomaly-based approaches like Deep Learning and

Machine Learning are a subset of AI which can be used to

distinguish abnormal and benign trafﬁc. Telecom manufac-

turers are currently paying more attention on anomaly-based

intrusion detection solutions due to their ability to effectively

detect cyber-attacks and advanced computing power. Palo-Alto

networks has launched a ﬁrst ever Machine learning-based

IDS (MLIDS) in 2020. Although, the performance of MLIDS

is dependent on valuable datasets for training and numerous

network attacks might be detected with high learning accuracy.

II. AVAIL AB LE DATAS ET S

In this research publicly available DDoS attacks datasets

from 2000 to 2019 have been evaluated and the necessity of a

reliable and comprehensive dataset for validation and testing

of DDoS attacks detection systems has also been closely

examined.

To detect an anomaly, numerous intrusion detection systems

datasets are available. Such datasets are created with the help

of synthetic methods or simulations that cannot reﬂect real

attack scenario as well as are not created through complete

protocols stack. The ﬁrst known dataset is called Defense

Advanced Research Project Agency (DARPA) created by Lin-

coln Laboratory in 1998 [11]. Later, researchers have created

many other datasets for example, DEFCON (2000-2002),

Knowledge Discovery and Data Mining (KDD99) (1998- 99),

DEFCON (2000-2002), Network Security Laboratory KDD

(2009), CAIDA 2000-2016 and Kyoto (2009). Each dataset has

its own limitations. The main limitations include composition

Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.

with low diverse trafﬁc, data size, real-time and non-realistic

attack scenario.

Imran et al. has pointed out in detail the weaknesses

in these datasets [4] and developed a new dataset called

CICIDS2017. The dataset was developed to detect intrusion

and is comprehensive among all the datasets created so far.

They have also created DDoS2019 dataset [5] using the initial

work [4]. In this dataset, the complete network topology is

used to generate different attacks trafﬁc as background trafﬁc

for multiple days and taken the trafﬁc to analyze it with

ﬂow oriented CICFlowMeter. It is an analyzer for network

trafﬁc and can generate 80 statistical attributes like number of

packets, packet length, duration, total bytes, etc., both in the

forward as well as backward direction.

The authors in [4], simulated the background trafﬁc in the

network that consist of 25 users for 5 protocols including,

FTP, SSH, HTTP, HTTPS, and email protocol. This paper

has explored their cases of DDoS attacks particularly for

own interest and to develop a new real time DDoS dataset.

Authors here utilized LOIC to simulate DDoS attacks. The

main deﬁciency in this dataset remains with the raw network

trafﬁc that is not real. Particularly, an actual DDoS attack

takes place when thousands of different attacks ﬂood and

overwhelm the bandwidth as well as services. Developing

real time datasets means to create real environment, imitating

a wider range of client behavior, both malicious as well as

benign, comprising imitation of thousands of known attacks

with the help of real stateful trafﬁc as well as applications.

Real-time scenarios are also included in the creation for

time varying and dynamic conditions. Most of the existing

DDoS datasets always have limitations for example dataset

developed by Iman et al. [5] In this project a realistic DDoS

was developed through modern simulators (HOIC, Slowloris,

DDoSIM, HULK, Goldeneye, Bonesi, Mirai Botnet, Tor Ham-

mers) which reduces the necessity of full network topology.

Subsequently, we performed the evaluation of our new dataset.

The evaluation has identiﬁed features that are crucial to detect

and evaluate the performance of available machine learning

algorithms. Along with that, the PCA analysis is performed

to reduce dimensionality. Using CRCFM2022 is a novel tech-

nique for developing realistic datasets. As per over in depth

analysis, this technique is not used in any existing dataset till

now.

This paper is organized as follows. Section III explains the

testbed, in Section IV Test Scenarios are given, Section V

provides results and analysis. Lastly, Section VI concludes the

research paper.

III. TES TB ED

Modern simulators (HOIC, Slowloris, DDoSIM, Mass,

Golden eye, Bonesi, Mirai Botnet, Tor Hammer) are used

for creating real world attack simulation. These simulators

are strong, simple to utilize and produces realistic application

trafﬁc as well as attacks for testing performance, versatility

along with security of application oriented networks. These

simulators are the industry perceived world’s most powerful

performing test systems for OSI layer 3-7 arrangement, and

it imitates real application trafﬁc. These have separate library

of thousands of practical applications as well as attack vectors

and are daily updated to guarantee load and utilitarian testing

with unparalleled versatility.

In DDoS attacks, attackers try to evade servers which causes

a lot of network bandwidth consumption due to ingress and

egress trafﬁc. By utilizing these test systems, the network

architecture was simultaneous and simpliﬁed, a full network

topology was created, as recommended for advancement of

the effective dataset. Interestingly, these simulators can imitate

hundred thousand attacks every second from Mac addresses

and IP spooﬁng, with different protocols/trafﬁc. The testbed

permits load balance for attacks and benign trafﬁc. Through

such many choices, the test system produces real time trafﬁc

through full convention stacks. Fig 1 presents the newly

developed testbed for DDoS attack scenarios.

The attacker group (172.16.221.108) and victim group

are two network groups utilized in this test system. Sim-

pliﬁed conﬁgurations were used in the victim network

such as Linux (172.16.221.133). Likewise, Windows servers

(172.16.221.133) with PCs (172.16.x.x) were used. SMTP,

DNS, and web services are provided by the Linux server.

In contrast, Windows server only provide web services. The

signiﬁcant connectivity to the victim network from attack

group is acknowledged with 10Gbps ﬁber network for de-

veloping attack scenarios bearing maximum load to 10Gbps

approximately. The outgoing and incoming trafﬁc is captured

through mirror port with the help of tcpdump. Table 1 presents

a list of workstations, servers and ﬁrewalls with their operating

system as well as associated private and public IPs in testing

and training days.

TABLE I

IPS AND TARGET NET WO RK OPE RATI NG SYS TE M

Machine OS Ips

Server Ubuntu 22.04

LTS (Web

Server)

172.16.221.133

Firewall pfSense 10.10.23.11

PCs (Training Day) Win 7 Home Ba-

sic

172.16.221.134

Win 8.1 pro 172.16.221.135

Win 10 Education 172.16.221.136

PCs (Testing Day) Win 7 Home Ba-

sic

172.16.221.137

Win 8.1 pro 172.16.221.138

Win 10 Education 172.16.221.139

The third party executed the assault families in testing and

training days. The Victim network comprises one server (Web

server), two switches, one ﬁrewall and four PCs. Likewise,

one port in the fundamental switch of the Victim network

is conﬁgured as the mirror port and dedicatedly captures all

incoming and outgoing trafﬁc to a network.

Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.

Fig. 1. Testbed

IV. TES T SCE NARIOS

Three types of DDoS attacks for the test include protocol

DDoS (PDDoS), volumetric DDoS (VDDoS) as well as ap-

plication layer attacks. The research focuses on the PDDoS

and VDDoS attack scenarios. The PDDoS is the large scale

DDoS attack based on protocols and designed to create state

exhaustion. Volumetric DDoS is also a larger scale data trans-

mission critical attack intended to affect network blockage.

On the other hand, VDDoS incorporates layers four and three

Flood attacks. These are typical ICMP ﬂooding (type 0, 4-18),

twisted ﬂooding utilizing multiple protocols like UDP, ICMP,

Raw IP ﬂoods, and TCP protocols. PDDoS incorporates layer-

4, the connection oriented RST, TCP SYN and ACK ﬂooding

attack. This multitude of attacks was created for fostering

the DDoS dataset, special attention was paid to fundamental

ﬂood attacks. Benign trafﬁc was imitated by utilizing mixed

trafﬁc and proﬁle of 80 clients together (incorporating HTTP,

HTTPS, FTP, SSH SMTP, RTSP, POP3, BitTorrent, Telnet,

SIP, and IMAP).

A. Test Procedures

The tests were executed in accordance with the testbed

portrayed in Section 2 above. Each sort of trafﬁc scenario is

created from simulators to servers of victims and tcpdump

collects information. PCAP format is used to capture raw

data. Whereas the simulators are utilized to create multiple

attackers with mock IP addresses and separate MAC addresses.

In this work, 5 days of simulations were used for the DDoS

attacks utilizing the LOIC recreation device and delivered 10-

70 GB of raw data, roughly. For launching realistic attacks

an exceptionally high load along with quick attacks was

utilized. The transfer speed was depleted through high load

in both inward and outward directions as well as captured

all communication from the attacks and victims’ networks

through mirror port. Almost thirty minutes of test time were

used for size equivalency of raw data, further details are

recorded in Table 2. Figure 2 shows a screen capture of the

benign mixed trafﬁc from the network in which we have

Fig. 2. Benign Trafﬁc

imitated 80 clients with different trafﬁc proﬁles like HTTP,

SMTP, SSH, and so on. 1 2

V. ANALYS IS

A testbed was created with multi simulators (including

HOIC, Slowloris, DDoSIM, HULK, Goldeneye, Bonesi, Mirai

Botnet, Tor Hammers). Considering the recommendations pro-

vided [10], dataset was assessed to make it a valuable dataset.

To improve the quality of DDoS attack dataset, trafﬁc was

produced from the attackers to the victim networks utilizing

streamlined yet full network topology, like displayed in ﬁgure

1. Moreover, real trafﬁc was produced utilizing simulators

for targeted network with different components of network

comprising servers, PCs, ﬁrewalls, switches and routers. Later

incoming and outgoing trafﬁc was captured for span of each

test from mirror port utilizing the tcpdump method. The

captured trafﬁc shows the connection of network components

inside targeted network. Utilizing simulator’s capabilities, the

1Demo Video https://bit.ly/3rGfT1R.

2Dataset Link https://github.com/CRC-Center/CRCDDoS2022.

Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.

TABLE II

DATASET DESCRIPTION

Attacks Type Attack Time File Size Target

Device

Avg. load Attack no. of address

ICMP ﬂood 24 hours 73.65 GB Ubuntu/Windows Server 1 Gbps 1500 MAC address with

1500 spoofed IP address

UDP Port 53 2 hours 5.20 GB Ubuntu Server 1 Gbps 1100 MAC address with

spoofed IP address 1100

Syn Port 25 2 hours 1.1GB Ubuntu Server 500 Mbps 1100 MAC address with

spoofed IP address 1100

Syn Port 80 2 hours 4.3 GB Ubuntu Server 500 Mbps 1100 MAC address with

spoofed IP address 1100

Syn Port 443 2 hours 4.1 GB Ubuntu Server 500 Mbps 1100 MAC address with

spoofed IP address 1100

Bots 5 hours 23 GB Ubuntu/Windows Server 800 Mbps 500 MAC address with

500 spoofed IP address

with

Port Map 3 hours 1 GB Ubuntu/Windows Server 200 Mbps 500 MAC address with

500 spoofed IP address

TFTP 2 hours 3 GB Ubuntu/Windows Server 400 Mbps 500 MAC address with

500 spoofed IP address

Benign Trafﬁc 20 hours 21.5 GB Ubuntu/Windows Server 1 Gbps MAC address with 80

Proﬁles address 80

Fig. 3. PCA Analysis

real trafﬁc that incorporates different protocols, including

the SSH protocols was produced. Similarly, different assault

vectors as well as normal trafﬁc was produced as shown in

ﬁgure 3, which shows variety of benign trafﬁc and attacks.

After that, the features were separated from the raw data

utilizing CICFlowmeter. The timestamps and different features

are available for labeled dataset that describe the trafﬁc ﬂow.

Hence, valuable DDoS dataset is created having over 5.6 mil-

lion streams. With CICﬂowmeter, 86 features were extricated

in the form of PCAP, from raw data. For detail of the separated

features, we propose reader survey [6]. We utilized 71 out

of 86 features for the underlying analysis. The full labeled

dataset incorporates more than 5.6 million records. For any

abnormality approach, this size of dataset is adequate for the

testing and training model. The dataset comprises of 30.57%

normal and 70.24% DDoS named streams. Figure 4, 5, 6, 7,

8, 9. presents the appropriation of different DDoS attacks as

well as normal trafﬁc.

For analysis purpose the Scikit-learn library along with

Python programming was used. As enormous number of

features were involved, therefore, dimensionality reduction has

been performed as per Principal Component Analysis (PCA)

[8]. PCA is an extremely normal technique to reduce the

dimensionality of a bigger set of features into a more modest

containing greater part of the data in the enormous set to

envision better and perceive complexity. Simultaneously, it

makes novel Principal Components that holds the most data.

Figure 3 presents the PCA examination. We observed 27 PCs

hold 96% data. Figure 10 presents the position of the important

27 features for every PCA examination.

We utilized our created dataset to quantify the presenta-

tion of a couple of few common ML algorithms. These are

Quadratic Discriminant Analysis (QDA), Random Forest (RF),

Gaussian Na¨

ıve Bayes (GNB), Iterative Dichotomise 3 (ID3)

and Multi-layer Perceptron (MLP, a class of feed forward

ANN) [9], [10].

At ﬁrst performance of above ML techniques was analyzed

utilizing full dataset, having 86 features and afterward utilized

a dataset with 27 features. Similar execution was acquired

for 27 features in view of the PCA examination. We likewise

Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.

Fig. 4. Normal Trafﬁc vs Day1 DDoS Attacks Fig. 5. Normal Trafﬁc vs Day2 DDoS Attacks Fig. 6. Normal Trafﬁcs vs Day3 DDoS Attacks

Fig. 7. Normal Trafﬁcs Vs Day4 DDoS Attacks Fig. 8. Normal Trafﬁc vs Day5 DDoS Attacks Fig. 9. Normal Vs DDoS Attacks

observe that performance with a larger number of tests gives

equivalent results to this multitude of algorithms and up to

98.5 - 100 percent precision. A typical inquiry is the size of

the data which can be used for testing. The answer indicates

thatonly dataset as well as algorithm’s complexity matters.

Mostly, it is observed that dataset is parted in a ratio of 80%

20% for training and testing, respectively. This is a result of the

restricted size of the dataset. An answer is found by utilizing

heuristic techniques to get the most appropriate size for the

training.

Observable changes were noted, and the size of the training

dataset was decreased. To quantify the exhibition of ML algo-

rithms, standard development measurements were used which

include Precision (Pr), F1 Measure or F1 score and Recall

(Rc). These were determined from the Confusion Matrix (CM)

on our dataset for each ML algorithms. CM represents the

presentation of a model of a dataset with known attribute.

Table III provides 50 percent training sets for each model;

however, precision of training is extremely high because of

the exceptionally larger size training dataset. The accuracy

tells the number of predictions made correctly by a model

and correct predictions to mean higher accuracy. The following

method can be used to calculate accuracy:

Accuracy =N o. of correct predictions

T otal number of P r edictions .(1)

The next parameter Precision can be calculated by ratio

between sum of true positive, true negative and true positive.

A better model shows high precision. Precision is calculated

as below:

P recision =T P

T P +F P .(2)

Another parameter is Recall. It is the sensitivity of a system.

It is calculated as below:

Recall =T P

T P +F N .(3)

The ﬁnal parameter is the F-Score, where Recall and

Precision are used in calculation. It is weighted average of

Precision and Recall. The F-Score lies between 0 and 1. 1

tells that Recall and Precision are perfect. While 0 shows that

either Recall or Precision is 0. F1-Score is calculated as:

Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.

Fig. 10. Feature Ranking

F1 −Score = (Recall)−1+ (P recision)−1

=2 ×Recall ×P recision

Recall +P recision

Albeit the performance shown by every technique is excep-

tionally near one another, the execution time has an effect.

The accuracy of detection for GNB and ID3 is almost 99.5%

to 100 percent, however best performance for per execution

time is given by GNB as presented in Table III.

GNB takes just 12.70 sec while MLP takes the most

noteworthy time, 2906.34 sec. In view of this examination,

we can utilize either GNB model or ID3 for continuous

DDoS identiﬁcation execution since they need minimum time

for recognizing extremely high accuracy. Like, the accuracy

of training of GNB recorded almost 0.99585, which reﬂects

in precise detection numbers. It should be visible from the

Confusion Matrix for GNB. GNB accurately characterized all

DDoS assaults and just identiﬁed around 0.1% false positive

in DDoS identiﬁcation weighted normal execution for ﬁfty

percent training and ﬁfty percent of testing i.e., 5.6 million

data. The description related to evolution matrix is given

underneath.

VI. CONCLUSION

A unique real dataset having more than 5.6 million data

samples for DDoS was generated with the help of multi

simulators in a network. The size of the sample is enough

to satisfy any anomaly detection mode for training efﬁciently

to detect aforementioned attacks. The attack diversity shown

by dataset was created by thousand spoofed attackers and with

10Gbps trafﬁc. The research community will be able to access

the labeled dataset (CRCDDoS2022) and raw trafﬁc created in

this project once it is ﬁnished. As the created dataset has larger

dimension, PCA analysis was performed, and dimensionality

was reduced from 86 to 27 features. However, this helped

to keep 96% of actual information in all features used. The

analysis of the commonly used ML algorithms indicated that

if quality dataset is used for test and train purpose then the

precision achieved may varies. The performance of all ML

algorithms used in this project can be compared using RC, Pr,

and F1 parameters. However, time required for execution can

be different. The lowest execution time was given by GNB

and the highest time was given by KNN. The use of multiple

simulators to develop this real dataset is novel approach. These

simulators are recognized by industry for validation and testing

and so far, not used to create any dataset available. The future

goal is the extension of this work to create a realistic IDS

system. 3 4

VII. ACKN OWLEDGEMENT

This work is supported in part by the Hubei Province Key

Research and Development Program (2021BAA027).

REFERENCES

[1] Jianxiong Li, Simon Kamin, Guoxing Zheng, Frank Neubrech, Shuang

Zhang, and Na Liu. Addressable metasurfaces for dynamic holography

and optical information encryption. Science advances, 4(6):eaar6768,

2018

[2] H. J. Hadi, S. M. Sajjad, and K. un Nisa, ”BoDMitM: Botnet detection

and mitigation system for home router base on MUD,” in 2019 Interna-

tional Conference on Frontiers of Information Technology (FIT), 2019,

pp. 139-1394: IEEE.

[3] K. M. Prasad, A. R. M. Reddy, K. V. J. G. J. o. C. S. Rao, and

Technology, ”DoS and DDoS attacks: defense, detection and traceback

mechanisms-a survey,” 2014

[4] I. Sharafaldin, A. H. Lashkari, and A. A. J. I. Ghorbani, ”Toward

generating a new intrusion detection dataset and intrusion trafﬁc char-

acterization,” vol. 1, pp. 108-116, 2018.

[5] I. Sharafaldin, A. H. Lashkari, S. Hakak, and A. A. Ghorbani, ”De-

veloping realistic distributed denial of service (DDoS) attack dataset

and taxonomy,” in 2019 International Carnahan Conference on Security

Technology (ICCST), 2019, pp. 1-8: IEEE.

[6] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani,

”Characterization of tor trafﬁc using time based features,” in ICISSp,

2017, pp. 253-262.

[7] O. Kramer, ”Scikit-learn,” in Machine learning for evolution strategies:

Springer, 2016, pp. 45-53.

[8] G. James, D. Witten, T. Hastie, and R. J. S. Tibshirani, doi, ”An

introduction to statistical learning, edited by: Casella, G., Fienberg, S.,

and Olkin, I,” vol. 10, pp. 978-1, 2017

[9] J. R. J. M. l. Quinlan, ”Induction of decision trees,” vol. 1, no. 1, pp.

81-106, 1986.

[10] L. Breiman, ”Random forests machine learning, vol. 45,” ed: Oct, 2001.

[11] 2000 DARPA Intrusion Detection Scenario Speciﬁc Datasets

MIT Lincoln Laboratory”, Ll.mit.edu, 2022. [Online]. Available:

https://www.ll.mit.edu/rd/datasets/2000-darpa-intrusiondetection-

scenario-speciﬁc-datasets. [Accessed: 11-Apr- 2022]

3Demo Video https://bit.ly/3rGfT1R.

4Dataset Link https://github.com/CRC-Center/CRCDDoS2022.

Authorized licensed use limited to: Wuhan University. Downloaded on April 19,2023 at 10:36:58 UTC from IEEE Xplore. Restrictions apply.

A Scalable Pattern Matching Implementation on Hardware using Data Level Parallelism

Conference Paper

Nov 2023

Cybersecurity as a service for Internet of Everything (IoE)

Preprint

Full-text available

Oct 2023

Within the Cyber realm, technological advances and connected devices have changed the world. The boom of smart devices (real-time objects) has provided users, businesses and organizations with an opportunity to central usability concepts with modernized operations. Today, the Internet of Things (IoT) is superseded by the Internet of Everything (IoE). The connected devices with sensors, for communication with other smart devices, users and Information Systems (IS) via the Internet pave way for hackers and attackers. The malicious users have a myriad of opportunities to compromise and conduct advanced attacks using smart devices, systems, and wireless sensors. It is an upheaval task to secure connections between systems, data, processes, and products in IoE. The research will be focused on security risks in IoE, implementation of security measures and authentication schemes to protect IoE.

Development of smart grid infrastructure in cybersecurity

Conference Paper

May 2023

Exploration of Machine Learning Algorithms for Development of Intelligent Intrusion Detection Systems

Conference Paper

May 2023

BoDMitM: Botnet Detection and Mitigation System for Home Router Base on MUD

Conference Paper

Full-text available

Dec 2019

Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy

Conference Paper

Full-text available

Oct 2019

Addressable metasurfaces for dynamic holography and optical information encryption

Article

Full-text available

Jun 2018

Metasurfaces enable manipulation of light propagation at an unprecedented level, benefitting from a number of merits unavailable to conventional optical elements, such as ultracompactness, precise phase and polarization control at deep subwavelength scale, and multifunctionalities. Recent progress in this field has witnessed a plethora of functional metasurfaces, ranging from lenses and vortex beam generation to holography. However, research endeavors have been mainly devoted to static devices, exploiting only a glimpse of opportunities that metasurfaces can offer. We demonstrate a dynamic metasurface platform, which allows independent manipulation of addressable subwavelength pixels at visible frequencies through controlled chemical reactions. In particular, we create dynamic metasurface holograms for advanced optical information processing and encryption. Plasmonic nanorods tailored to exhibit hierarchical reaction kinetics upon hydrogenation/dehydrogenation constitute addressable pixels in multiplexed metasurfaces. The helicity of light, hydrogen, oxygen, and reaction duration serve as multiple keys to encrypt the metasurfaces. One single metasurface can be deciphered into manifold messages with customized keys, featuring a compact data storage scheme as well as a high level of information security. Our work suggests a novel route to protect and transmit classified data, where highly restricted access of information is imposed.

Characterization of Tor Traffic using Time based Features

Conference Paper

Full-text available

Jan 2017

Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization

Conference Paper

Jan 2018

With exponential growth in the size of computer networks and developed applications, the significant increasing of the potential damage that can be caused by launching attacks is becoming obvious. Meanwhile, Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are one of the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of adequate dataset, anomaly-based approaches in intrusion detection systems are suffering from accurate deployment, analysis and evaluation. There exist a number of such datasets such as DARPA98, KDD99, ISC2012, and ADFA13 that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Based on our study over eleven available datasets since 1998, many such datasets are out of date and unreliable to use. Some of these datasets suffer from lack of traffic diversity and volumes, some of them do not cover the variety of attacks, while others anonymized packet information and payload which cannot reflect the current trends, or they lack feature set and metadata. This paper produces a reliable dataset that contains benign and seven common attack network flows, which meets real world criteria and is publicly available. Consequently, the paper evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.

Induction of Decision Trees

Article

Mar 1986

Ross Quinlan

The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

Scikit-Learn

Chapter

May 2016

Oliver Kramer

scikit-learn is an open source machine learning library written in Python.

Random forests, machine learning 45

Article

Jan 2001

L. Breiman

An Introduction to Statistical Learning

Book

Jan 2013

An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

Induction of Decision Trees

Article

Mar 1986

Ross Quinlan

Developing Realistic Distributed Denial of Service (DDoS) Dataset for Machine Learning-based Intrusion Detection System

Abstract and Figures

Recommended publications

2022 9th International Conference on Internet of Things: Systems, Management and Security (IOTSMS)

Network Intrusion Detection for IoT Security Based on Learning Techniques

One-Dimensional Convolutional Neural Network for Detection and Mitigation of DDoS Attacks in SDN

Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy