ArticlePDF Available

Chameleon: Optimized Feature Selection using Particle Swarm Optimization and Ensemble Methods for Network Anomaly Detection

March 2022
Computers & Security 117(4):102684

March 2022
117(4):102684

DOI:10.1016/j.cose.2022.102684

Authors:

Aniss Chohra

Concordia University Montreal

Paria Shirani

University of Ottawa

Elmouatez Karbab

Concordia University Montreal

Mourad Debbabi

Concordia University Montreal

In this paper, we propose an optimization approach by leveraging swarm intelligence and ensemble methods to solve the non-deterministic feature selection problem. The proposed approach is validated on two benchmark datasets, namely, NSL-KDD and UNSW-NB15, in addition to a third dataset, called IoT-Zeek dataset, which consists of Zeek network-based intrusion detection connection logs. We build the IoT-Zeek dataset by employing ensemble classification and deep learning models using publicly available malicious and benign threat intelligence on the Zeek connection logs of IoT devices. Moreover, we deploy and validate a deep learning-based anomaly detection model using autoencoders on each of the aforementioned datasets by utilizing the selected features obtained from the proposed optimization approach. The obtained results demonstrate that our approach outperform the existing state-of-the-art machine learning models in terms of f1 score results, with 92.092% f1 score on NSL-KDD dataset, 92.904 f1 score on UNSW-NB15 dataset, and 97.302 f1 score on IoT-Zeek dataset.

Maliciousness Detection Pipeline.

…

CNN Maliciousness Detection Model's Architecture.

…

Deep Learning Anomaly Detection: Train and Validation Loss.

…

Autoencoder Anomaly Detection ROC Curves.

…

+10

Autoencoder Anomaly Detection ROC Curve on IoTZeek-Oversampled Dataset.

…

Figures - uploaded by Paria Shirani

Content may be subject to copyright.

Content uploaded by Paria Shirani

Content may be subject to copyright.

CHAMELEON: Optimized Feature Selection using Particle Swarm

Optimization and Ensemble Methods for Network Anomaly Detection

Aniss Chohraa,∗, Paria Shiranib,∗∗, ElMouatez Billah Karbabaand Mourad Debbabia

aSecurity Research Centre, Gina Cody School of Engineering and Computer Science, Concordia University, Montréal, Québec, Canada

bDepartment of Computer Science, Ryerson University, Toronto, Ontario, Canada

ARTICLE INFO

Keywords:

Feature Selection

Swarm Intelligence

Particle Swarm Optimization (PSO)

Ensemble Methods

Internet of Things (IoT)

Network Anomaly Detection

Deep Learning

ABSTRACT

In this paper, we propose an optimization approach by leveraging swarm intelligence and ensemble

methods to solve the non-deterministic feature selection problem. The proposed approach is validated

on two benchmark datasets, namely, NSL-KDD and UNSW-NB15, in addition to a third dataset,

called IoT-Zeek dataset, which consists of Zeek network-based intrusion detection connection logs.

We build the IoT-Zeek dataset by employing ensemble classiﬁcation and deep learning models using

publicly available malicious and benign threat intelligence on the Zeek connection logs of IoT devices.

Moreover, we deploy and validate a deep learning-based anomaly detection model using autoencoders

on each of the aforementioned datasets by utilizing the selected features obtained from the proposed

optimization approach. The obtained results demonstrate that our approach outperform the existing

state-of-the-art machine learning models in terms of 𝑓1score results, with 92.092% 𝑓1score on NSL-

KDD dataset, 92.904 𝑓1score on UNSW-NB15 dataset, and 97.302 𝑓1score on IoT-Zeek dataset.

1. Introduction

Due to the emerging technologies, the large connectivity

between diﬀerent devices in diﬀerent ecosystems, and the

increasing rate of cyberattacks (e.g., IoT attacks increased

700% in the last two years1), security analysis of the net-

work data is an absolute need. However, providing accurate

and eﬃcient threat detection solutions on large volume of

data becomes more challenging. On the other hand, during

the last decade, machine learning and deep learning tech-

niques have attracted tremendous attention in many ﬁelds

(e.g., anomaly detection, vulnerability assessment, natural

language processing, stock market, and weather forecast).

Therefore, training eﬃcient and scalable machine learning

and deep learning based threat detection models became a

task of paramount importance.

There are two common and known problems that need to

be addressed to provide eﬃcient, accurate and scalable mod-

els. (i) Selecting the appropriate setting of hyper-parameters

for the model to be trained: this task generally falls in the

non-deterministic problems class, as it might have several

solutions that give the same accuracy results; meaning that

this kind of problem accepts at least two possible solutions

(optimal solutions). (ii) Selecting the appropriate set of fea-

tures that best deﬁne the ﬁnal problem. There exists lots of

features in most of the domains, which makes it time con-

∗Corresponding author.

∗∗ Part of this work has been done during the postdoctoral fellowship of

the author at Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

a_chohra@encs.concordia.ca ( Aniss Chohra);

paria.shirani@ryerson.ca ( Paria Shirani); e_karbab@encs.concordia.ca (

ElMouatez Billah Karbab); debbabi@ciise.concordia.ca ( Mourad

Debbabi)

ORCID (s): 0000-0003-1823-713X ( Aniss Chohra); 0000-0001-5592-1518

( Paria Shirani); 0000-0003-1293-8314 ( ElMouatez Billah Karbab);

0000-0003-3015-3043 ( Mourad Debbabi)

1https://www.darkreading.com/endpoint/

iot-specific- malware-infections- jumped-700- amid-pandemic

suming to train and validate the models. Moreover, some

of those features are irrelevant due to the presence of redun-

dancy, sparsity, or lack of correlation to the problem to be

solved. Therefore, the need for methods to better ﬁlter the

irrelevant features has become a widely adopted procedure

before any model training and experimentation.

There is a palette of techniques that have been proposed

to select the most relevant features. The most common

approach is the use of ensemble methods (e.g., Sagi and

Rokach (2018); Sheikhpour et al. (2017); Lazar et al. (2012))

due to the fact that these methods provide easier explana-

tion of the variables compared to other techniques. How-

ever, sometimes it becomes diﬃcult to know which features

are given more importance than others by the model Gomes

et al. (2017). In addition, ensemble learning techniques

combine multiple models all together in order to improve the

overall predictive capability and to decrease the overﬁtting

as much as possible.

There exists state-of-the-art techniques that are proposed

to deal with the non-deterministic aspect of the feature se-

lection problem. These works generally use optimization al-

gorithms to ﬁnd optimal solutions and make decisions ac-

cording to a certain objective function. For instance, Ah-

mad et al. (2018) propose a feature selection approach using

artiﬁcial bee colony (ABC), and Dong et al. (2018) incorpo-

rates a hybrid genetic algorithm with granular information.

However, these approaches do not explore the usage of their

solutions on other types of datasets (e.g., intrusion detection

systems (IDS)).

On the other hand, Autoencoders Liu et al. (2017); Jia

and Zhang (2018) are a type of neural network, which

aim at reconstructing a given input into an output with the

least possible changes. Autoencoders are widely used for

anomaly detection Chalapathy and Chawla (2019); Ahmed

et al. (2016); Kwon et al. (2019); Xie et al. (2011) due to their

ability to better represent (compress) the data to a latent-

Aniss Chohra et al.: Preprint submitted to Elsevier Page 1 of 19

Optimized Feature Selection for Network Anomaly Detection

representation (bottleneck), which consists of a reduced rep-

resentation of the input. In addition, their ability to recon-

struct the input, makes them more suitable to the anomaly

detection task; the anomalies can be detected by comparing

the reconstructed output with the input, and then ﬂag it as

anomaly if there are any deviations from the input.

Moreover, there exists diﬀerent works that are proposed

to detect anomalies using deep learning models, e.g., Du

et al. (2017); Merrill and Eskandarian (2020); Hwang et al.

(2020); Dutta et al. (2020); Xiong et al. (2020); Doan and

Zhang (2020); Chalapathy et al. (2020). However, to the best

of our knowledge, none of them explores the eﬀects of fea-

ture selection before applying the anomaly detection model.

Problem Statement: Using all the features present in the

input data to training autoencoders can be quite troublesome

and time consuming, especially where the input data con-

tains millions of records, making the experimentation and

model engineering more complicated and diﬃcult. More-

over, autoencoders focus mainly on feature engineering and

extraction rather than feature selection. In other words, by

transforming the input data into a compressed representa-

tion, autoencoders are able to reduce the dimensionality of

the data and learn a smaller representation. In the case of

large number of features, explaining and understanding the

compressed data is diﬃcult, whilst feature selection identi-

ﬁes the most useful and relevant features that best describe

and deﬁne a given ground truth variable Hartmann (2004).

Generally, during the feature selection process, choos-

ing the appropriate set of hyper-parameters is quite chal-

lenging. This is due to the fact that it is a non-deterministic

problem, which can have multiple optimal solutions; all of

them would give the same performance and accuracy results.

Thus, even after exhaustive experimentation, there is no ev-

idence to prove that (i) all the possible solutions have been

explored, and (ii) a particular solution is the best one.

Key Idea: In this context, we propose a novel approach,

called CHAMELEON, which combines both swarm intelli-

gence and ensemble learning techniques to select the optimal

settings (hyper-parameter selection for the ensemble models

as well as selection of most relevant features for each in-

dividual dataset) for feature selection task. The proposed

approach uses ensemble learning classiﬁers as a ﬁtness and

evaluation function for each individual/particle within the

population/swarm. This population aims to converge to the

optimal solutions in an iterative process, where in each it-

eration, each individual tries to move closer to the optimal

solutions. Afterwards, we use the selected features given by

the optimal ensemble model to construct an anomaly detec-

tion autoencoder; we iteratively improve the model until it

outperforms the state-of-the-art models.

Contributions: The main contributions of our work are

summarized as follows:

•Novel feature selection for network datasets: We pro-

pose a feature selection approach for network datasets

that leverages both swarm intelligence and ensemble

methods to select the most relevant features. The en-

semble methods are used as a ﬁtness function within

the optimization approach in order to leverage their

ability to better interpret and select the independent

features.

•Training time improvement: We employ the selected

features obtained from the optimization step and de-

ploy deep learning models for network anomaly detec-

tion. The feature selection process considerably im-

proves the training time compared to the case where

all features are used.

•Malicious and benign dataset generation: We setup

an environment and generate a dataset called IoT-Zeek

dataset from PCAPS and connection logs using Zeek

NIDS Team (2018). Then, we introduce an ensem-

ble model leveraging classical machine learning and

deep learning classiﬁers in order to learn malicious be-

haviour on the generated network traﬃc and classify

network logs into malicious or benign connections.

•Evaluation: We evaluate our proposed approach on

diﬀerent datasets (i.e., IoT dataset: IoT-Zeek, and

non-IoT datasets: NSL-KDD and UNSW-NB15) and

demonstrate its eﬃciency and performance. In addi-

tion, performed experiments on the selected features

obtained from the optimal solution for each dataset in-

dicate that our proposed model outperforms existing

works.

This paper is organized as follows. The most rele-

vant state-of-the-art works are discussed in section 2. An

overview of the proposed approach along with the method-

ologies are presented in section 3. The evaluation results are

provided in section 4. The limitations of our approach along

with the concluding remarks are presented in section 5.

2. Related Work

In this section, we present the most relevant and impor-

tant works that have been proposed for: (i) feature selec-

tion using optimization algorithms and (ii) anomaly detec-

tion and maliciousness ﬁngerprinting using machine learn-

ing and deep learning models.

2.1. Feature Selection Using Optimization

Ahmad et al. (2018) propose a feature selection approach

using Artiﬁcial Bee Colony (ABC). In addition, they in-

tegrate a Kalman ﬁlter2alongside Hadoop ecosystem3for

noise removal. The system is validated on ten datasets and

compared with swarm intelligence approaches. However,

the authors have not applied their approach on IDS datasets.

Dong et al. (2018) propose a technique for feature se-

lection, which incorporates a hybrid genetic algorithm with

2http://web.mit.edu/kirtley/kirtley/binlustuff/literature/

control/Kalman%20filter.pdf

3https://hadoop.apache.org/

Aniss Chohra et al.: Preprint submitted to Elsevier Page 2 of 19

Optimized Feature Selection for Network Anomaly Detection

granular information. This technique is tested on 11 bench-

mark ﬁnancial datasets and has been compared with cer-

tain state-of-the-art techniques. The obtained results demon-

strate that it achieves high classiﬁcation accuracy. However,

their work does not explore the usage of the approach on

other types of datasets (e.g., IDS and network dataset).

In Xue et al. (2012), a novel feature selection approach is

proposed for classiﬁcation, where the feature selection task

is considered as a non-deterministic problem. The authors

investigated two types of multi-objective particle swarm op-

timization algorithms (PSO). The ﬁrst one leverages the con-

cept of non-dominated sorting in the feature selection prob-

lem. Whilst the second one introduces more evolutionary

concepts (mutation and crossover) to search for better opti-

mal solutions. These two algorithms were then compared

with two standard feature selection techniques and then val-

idated on twelve benchmark datasets. However, they did not

explore the usage of more complex ﬁtness functions.

A novel approach for feature selection is proposed is Liu

et al. (2011), which combines multi-swarm particle swarm

optimization (MSPSO) and support vector machines (SVM)

as ﬁtness function, with 𝑓1score being the ﬁtness value. The

goal was to execute both kernel optimization and feature se-

lection simultaneously in order to get better generalization.

The proposed approach was then compared with state-of-

the-art feature selection algorithms using PSO, genetic algo-

rithm (GA), and grid search, using ten UCI (UC Irvine)4ma-

chine learning benchmark datasets for validation. The evalu-

ation results show that their novel technique outperforms the

three aforementioned techniques. However, the proposed al-

gorithm is only speciﬁc to the datasets used for validation

and has not been tested on the network IDS datasets.

In Ghamisi and Benediktsson (2014), the authors pro-

posed a feature selection approach which combines both

GA and PSO algorithms, where SVMs are used as ﬁtness

function and the accuracy metric as ﬁtness value. The

proposed technique was validated on Indian Pines Spectral

dataset NASA AVIRIS Sensor (2021) and the results show

that the approach can select the most relevant features that

allow higher accuracy results for classiﬁcation. However,

the authors did not present an exhaustive study on bench-

mark datasets neither a comparative study with state-of-the-

art techniques. Moreover, the proposed solution is only lim-

ited to the utilized dataset.

A novel approach for feature selection with combining

genetic algorithm with neural networks (HGA-NN) intro-

duced in Oreski and Oreski (2014). The approach was ap-

plied to real-world credit dataset collected from the Croat-

ian Bank, and furthermore evaluated on a benchmark credit

dataset selected from UCI database. Finally, this technique

was compared to existing classiﬁcation works in terms of ac-

curacy results and showed that it outperforms them. How-

ever, we ﬁnd that this technique focuses more on the accu-

racy rather than 𝑓1score, and has only been applied to UCI

datasets.

4https://archive.ics.uci.edu/ml/datasets.php

2.2. Deep Learning and Anomaly Detection

In Tama et al. (2020), the authors present a novel

anomaly detection system on web applications by propos-

ing a stacked ensemble by combining other ensemble mod-

els (e.g., random forests, xgboost). Four datasets (CSIC-

2010v2, CICIDS-2017, NSL-KDD, UNSW-NB15) were

used for the validate of their approach. The obtained results

show that the proposed stacked model outperforms exist-

ing web attacks detection solutions in terms of accuracy and

false positive rate (FPR) metrics. However, the authors have

not preformed a scalability and complexity study of their

approach; especially for two large datasets (UNSW-NB15

and CICIDS-2017). Nkenyereye et al. (2021) also proposed

a stacked-based model for anomaly-based intrusion detec-

tion systems; where the based learners/models are basically

deep neural networks (DNN). Their approach is then val-

idated on benchmark datasets (NSL-KDD, UNSW-NB15,

and CICIDS-2017) and evaluated using several metrics in-

cluding the accuracy, false positives rate, and Matthew’s

Correlation Coeﬃcient. The obtained results prove that

their model outperforms simple DNN-based anomaly model

in addition to some state-of-the-art techniques (by achiev-

ing 89.97%, 92/83%, and 99.65% on the three aforemen-

tioned benchmark datasets respectively). However, they

have not preformed a scalability study of their model on

these datasets.

In Hamamoto et al. (2018), the authors present a novel

approach for anomaly detection which combines both ge-

netic algorithm and fuzzy logic. More speciﬁcally, the ge-

netic algorithm is deployed in order to better represent ﬁn-

gerprints of network segments using network ﬂow data. This

also allows to predict network traﬃc behaviours for speciﬁc

and pre-deﬁned time windows. Then, fuzzy logic is used to

decide whether there are some anomalous behaviours within

those time-windows. Their approach was validated and eval-

uated on real-world network traﬃc data and it has proven to

be eﬀective by achieving 96.53% of accuracy and 0.56% of

false positives rate.

Ma et al. (2021) proposed a novel approach for anomaly

detection on network traﬃc data, called SVM-L, which com-

bines both SVM and Linear Discriminant Analysis (LDA).

More speciﬁcally, URLs from the data are used as input and

converted into vector format using natural language process-

ing (NLP) and statistical techniques. Then, these vectors are

fed to the SVM model in order to classify them into anoma-

lies or normal. In addition, the authors used an optimization

algorithm in order to optimize the hyper-parameters of the

SVM classiﬁer. The validation results of the SVM-L model

shows that it achieves 99% of accuracy on the tested datasets.

There exist several solutions (e.g., Alsaheel et al. (2021),

Shen and Stringhini (2019)) that improve the results of the

maliciousness segregation using advanced machine learning

and NLP techniques on log ﬁles. For instance, Shen and

Stringhini (2019) propose attack2vec to detect emerging net-

work attacks by leveraging dynamic word embeddings tech-

niques. Similar to NLP word embeddings, their approach

produces a dense representation of the security events while

Aniss Chohra et al.: Preprint submitted to Elsevier Page 3 of 19

Optimized Feature Selection for Network Anomaly Detection

considering the time factor. Moreover, in Alsaheel et al.

(2021), the authors propose Atlas, a framework for attack in-

vestigation that leverages NLP and deep learning techniques

to segregate attacks and non-attacks using logs as input. At-

las begins with processing the logs and building a causal de-

pendency graph between the events found in the logs. This

graph is augmented using NLP techniques and used to train

a sequence-base model that represents the attack semantics.

The produced models help the cyber analyst identify key at-

tack steps that share similarities with previous patterns. On

the contrary, our proposed IoT real-world dataset genera-

tion (presented in subsection 3.4) ﬁngerprints malicious logs

from the IoT network traﬃc data by leveraging an ensem-

ble model constructed using several models/classiﬁers (e.g.,

Random Forests, XGBoost, CatBoost, NN, and CNN).

In Roy and Singh (2021), the authors present a study of

diﬀerent anomaly detection classiﬁers before and after ap-

plying feature selection. More speciﬁcally, the authors com-

pare diﬀerent machine learning classiﬁers by training each

model twice. In the ﬁrst iteration, they use all the exist-

ing features from the dataset. During the second iteration,

they ﬁrst tune the classiﬁer with several feature selection al-

gorithms; then they select the feature selection algorithm

which gives the best accuracy results, and use the selected

features with the same classiﬁer as for the ﬁrst training it-

eration. The reported results show that the J84 classiﬁer

achieves the highest accuracy.

The authors of Mahalakshmi et al. (2021) applied a con-

volutional neural network (CNN) model to detect anomalies

eﬃciently. Obtained result show that their proposed CNN

model achieves an accuracy of 93.5%. However, the authors

have not compared their work with any other state-of-the art

approaches.

In Min et al. (2021), the authors introduced a novel

network anomaly detection technique, called memory-

augmented deep auto-encoder (MemAE). Autoencoder was

used to reconstruct the behavior of abnormal samples that

look close to normal ones; thus, the authors are solving the

problem of over-generalization, which occurs with abnormal

samples on autoencoders.

Roy et al. (2022) propose a lightweight intrusion detec-

tion system, called B-Stacking, based on supervised ma-

chine learning. A series of feature transformations, dimen-

sionality reduction and feature selection methods are applied

to produce the learning features. Afterwards, the authors

propose B-Stacking, a machine learning ensemble that uses

K-Nearest Neighbors (KNN), Random Forest, and XGBoost

to detect network anomalies. The system is claimed to be

lightweight and targets IoT devices, however, the experi-

ments has been carried out on Intel Core i5-9400F CPU 2.90

GHz notebook with 8GB of RAM and the system consumes

3.4% of the RAM and 1.5% to 2.9% of the CPU in this high-

end notebook machine, which is considered highly demand-

ing for an IoT device. In addition, the detection run-time has

not been reported.

The authors in Rashid et al. (2022) propose a stacking en-

semble technique (SET) with SelectKBest feature selection

technique for network anomaly detection. First, dimension-

ality reduction and features selection are applied to segre-

gate relevant features. Next, an ensemble of Decision Trees,

Random Forest, and XGBoost machine learning models are

employed to detect anomalies. However, the use of Selec-

tKBest technique is less adaptive to new malicious network

traﬃc over the time.

3. Materials and Methods

In this section, we ﬁrst provide background on the re-

lated topics, then we present an overview of our approach.

Next, we provide details on the proposed methodologies for

feature selection and anomaly detection. Finally, we present

our approach to generate the IoT-Zeek dataset.

3.1. Background

In the following, we provide an overview on Particle

swarm optimization (PSO) and ensemble methods.

3.1.1. Particle Swarm Optimization

Particle swarm optimization (PSO) (Kennedy and Eber-

hart (1995)) is a stochastic and meta-heuristic optimization

algorithm, which was ﬁrst inspired by the social behaviour

of some animals (e.g., birds and ﬁshes). In the PSO algo-

rithm, the population of individuals is referred to as swarm,

and each individual within the swarm is referred to as par-

ticle. These particles try to ﬁnd the set of optimal solu-

tions to a given problem by constantly updating their posi-

tions according to their own performance, which is called

cognitive aspect, and the current overall performance of the

swarm is called social aspect. Thus, PSO is based on two

essential logic: cooperation/collaboration and competition,

where the former represents the ability of one particle to

communicate with other particles in order to collaborate

their eﬀorts towards the optimal solutions, whilst the latter

represents one particle’s desire to use its own performance

and move towards the possible solution.

Moreover, each particle is deﬁned within a search space,

which represents the set of hyper-parameters to be optimized

for the solution. Depending on the swarm’s global solution,

each particle computes the cognitive aspect and the social

aspect according to Equation 1 and Equation 2, respectively,

as follows:

𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 =𝑐1 × 𝑟1×(𝑝𝑜𝑠_𝑏𝑒𝑠𝑡𝑖−𝑝𝑜𝑠𝑖)(1)

𝑠𝑜𝑐𝑖𝑎𝑙 =𝑐2 × 𝑟2×(𝑝𝑜𝑠_𝑔𝑙𝑜𝑏𝑎𝑙 −𝑝𝑜𝑠𝑖)(2)

where 𝑐1and 𝑐2are called acceleration constants and de-

ﬁne the speed at which a particle should move towards the

optimal solutions (𝑐1deﬁnes the speed at which the parti-

cle should converge to its local solution, whilst 𝑐2deﬁnes

the speed of convergence of the whole swarm towards the

global solution), 𝑟1and 𝑟2are two randomly generated val-

ues to control the stochastic inﬂuence of the cognitive and

social components on the overall velocity of a particle, 𝑝𝑜𝑠𝑖

represents the position of a particle at iteration i, 𝑝𝑜𝑠_𝑏𝑒𝑠𝑡𝑖

represents the local optimal solution found by that particle

Aniss Chohra et al.: Preprint submitted to Elsevier Page 4 of 19

Optimized Feature Selection for Network Anomaly Detection

so far, and 𝑝𝑜𝑠_𝑔𝑙𝑜𝑏𝑎𝑙 represents the position of the global

solution found by the entire swarm so far.

Afterwards, each particle updates its velocity using

Equation 3, where 𝑡represents the current particle, 𝑣𝑖(𝑡)rep-

resents the velocity of the current particle at iteration 𝑖, and

𝑤is the inertia weight (importance) given for that veloc-

ity (the smaller the weight 𝑤, the stronger the convergence

towards the global optimum). Finally, the position of each

particle is updated using Equation 4.

𝑣𝑖(𝑡+ 1) = 𝑤×𝑣𝑖(𝑡) + 𝑠𝑜𝑐 𝑖𝑎𝑙 +𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 (3)

𝑝𝑖(𝑡+ 1) = 𝑝𝑖(𝑡) + 𝑣𝑖(𝑡)(4)

3.1.2. Ensemble Methods

Ensemble methods are a type of machine learning mod-

els, which consist of a combination of multiple classiﬁer-

s/predictors in order to improve the performance of the over-

all classiﬁcation/prediction. In other words, ensemble meth-

ods combine the decisions made by multiple models using

techniques such as: majority voting, average, and weighted

average. Moreover, these techniques provide easier interpre-

tation of features and better predictive performance with less

overﬁtting compared to other machine learning techniques.

These family of machine techniques is generally classiﬁed

into two major types (Bühlmann (2012)), which are pre-

sented in Figure 1, as follows:

1. bagging, where all the used predictors are running

in parallel and independently, these models are then

combined using an aggregation technique to make the

ﬁnal decision. An example of such types are ran-

dom forests, where a sample called bootstrap is se-

lected randomly and fed to one model. Therefore, each

model in the forest will have a diﬀerent observation

and thus leading to no correlation between these pre-

dictors, making them less prune to overﬁtting.

2. boosting, deploys the paradigm of learning from each

one’s predecessor’s errors/mistakes (called residuals).

Therefore, these types of ensemble methods are exe-

cuted in a sequential order, which gives them an ad-

vantage over the ﬁrst type consisting of less training

time delays. An example of ensemble techniques in-

cludes gradient boosting technique.

3.2. Approach Overview

In order to identify anomalous connections, we propose

a deep-learning based autoencoder anomaly detection. The

input to this model is a set of features obtained from the net-

work traﬃc connection logs. We propose a hybrid model

consisting of PSO algorithm and ensemble methods to iden-

tify the most relevant set of features for any given dataset.

During this process, we explore two types of ﬁtness func-

tions (models); the ﬁrst one belongs to the bagging ensem-

ble methods family (Random Forests), and the second one

belongs to the boosting method (gradient boosting).

Predictor 1

Predictor 2

Predictor 3

Bootstrap 1

Bootstrap 2

Bootstrap 3

Bagging Vs.

Bootstrap 1 Predictor 1

Bootstrap 2 Predictor 2

Bootstrap 3 Predictor 3

Boosting

Figure 1: Bagging vs. Boosting Ensemble Methods.

1. Search Space

Definition

2. Fitness and

Objective

Functions

Definition

3. Algorithm

Initialization 4. Iterative Search

5. Optimal

Solutions

Selection

Input

Dataset

PSO

Selected

Features

6.Deep

Learning

Anomaly

Detection

Optimized Feature Selection

Figure 2: Approach Overview.

An overview of our approach is represented in Figure 2.

Feature selection can be viewed as ﬁve sequential steps: ﬁt-

ness and objective function deﬁnition,search-space deﬁni-

tion,algorithm initialization,iterative search, and optimal

solutions selection. The proposed approach starts by deﬁn-

ing the search space (Step 1) for PSO (Kennedy and Eber-

hart (1995); Ali and Malebary (2020)) depending on the

chosen ﬁtness function (Step 2). Afterwards, it takes as in-

put any labeled dataset and initializes a ﬁxed size popula-

tion/swarm by generating random particles (Step 3). Given

a precise number of iterations, each particle will then try

to ﬁnd the optimal position of the solution by updating and

changing constantly its position within the search space ac-

cording to the performance of the swarm and its own per-

formance (Step 4). The goal of the swarm is to ﬁnd the op-

timal model (optimal hyper-parameters) which maximizes

certain performance metrics (Step 5). Finally, we consider

only the set of best ﬁtting settings (e.g., hyper-parameters),

which give us the highest accuracy metrics. We then use

these settings to build the ﬁnal model(s) in order to extract

the most relevant features. Afterwards, we leverage the se-

Aniss Chohra et al.: Preprint submitted to Elsevier Page 5 of 19

Optimized Feature Selection for Network Anomaly Detection

lected features discovered during the optimization part and

engineer an anomaly detection model using deep learning

autoencoders (Step 6) Lauzon (2012). Our goal is to get an

anomaly detection model which outperforms existing mod-

els in terms of 𝑓1score metric.

3.3. Methodology

In this section, we provide more details on the proposed

methodology.

3.3.1. Optimized Feature Selection

Our feature selection algorithm can be performed as ﬁve

sequential steps. The algorithmic description of the opti-

mized feature selection is presented in Algorithm 1 and Al-

gorithm 2. In what follows, we explain each step in detail.

Algorithm 1 Feature Selection: Algorithmic Description

Input: 𝐷 ⊳ Input dataset

Output: 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠

1: global variables

2: 𝑐1,𝑐2⊳cognitive and social acceleration constants, respectively

3: 𝑟1,𝑟2⊳cognitive and social random factors, respectively

4: 𝑤,⊳velocity’s weight

5: 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒,

6: 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,

7: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠,⊳global solution ﬁtness value

8: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛,⊳global solution’s position

9: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠_𝑙 𝑖𝑠𝑡,⊳optimal solutions positions and ﬁtness values

10: end global variables

11: 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 ←DEFINE_F ITNES S

12: 𝑏𝑜𝑢𝑛𝑑𝑠 ←DE F_SEARC H_SPACE(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛)

13: (𝑠𝑤𝑎𝑟𝑚, 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒, 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠, 𝑐 1, 𝑐2, 𝑟1, 𝑟2, 𝑤)←ALGO RITHM _INIT

14: for each 𝑖∈𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 do ⊳Iterative search

15: for each 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑘∈𝑠𝑤𝑎𝑟𝑚 do

16: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 ←EVALUATE_FITNE SS(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘)

17: if 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 > 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 then

18: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 ←𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠

19: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ←𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛

20: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠_𝑙𝑖𝑠𝑡 += 𝑔 𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛

21: 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ←𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛

22: end if

23: UPDATE_V ELOCIT Y(c1, c2, r1, r2, w, global_position, particle_k_position,

personal_best_position, particle_k_velocity)

24: UPDATE_PO SITION (particle_k_velocity, particle_k+1_position)

25: end for

26: end for

27: 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑡𝑖𝑜𝑛𝑠 ←𝑚𝑎𝑥(𝑓1_𝑠𝑐𝑜𝑟𝑒, 𝑔 𝑙𝑜𝑏𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠_𝑙𝑖𝑠𝑡)⊳Fitness and

objective function deﬁnition,Optimal solutions selection

28: return 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑡𝑖𝑜𝑛𝑠

Fitness and objective functions: First, we deﬁne the ﬁtness

function to be used to evaluate the performance of each par-

ticle within the swarm. We choose ensemble methods clas-

siﬁers due to their advantages and beneﬁts, which include

low overﬁtting and high accuracy. Each particle is fed to

the classiﬁer which in turn will automatically adapt to it and

will be trained on the dataset accordingly. At the end of this

process, the ﬁtness function returns the following evaluation

metrics: accuracy, recall, precision, and 𝑓1score.

Next, we deﬁne the objective function to be satisﬁed by

the set of possible optimal solutions (evaluate the whole so-

lutions). The objective function helps ﬁlter a set of results

and keep only those that satisfy our needs. In this work, since

we integrate ensemble models as evaluation/ﬁtness func-

tions, we should select a metric which best describes the

performance of these models at each particle level. From

the above-mentioned evaluation metrics, we choose the lat-

ter one (𝑓1score), since it represents the weighted aver-

Algorithm 2 Feature selection: Functions Deﬁnitions

1: procedure DEFINE _FITNE SS ⊳Fitness and objective function deﬁnition

2: 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 ←𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒_𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑟 ⊳ tuned between random forests

and gradient boosting

3: return 𝑓𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛

4: end procedure

5: procedure DEF_SE ARCH_S PACE(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛)⊳Search-space deﬁnition

6: if 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 == 𝑟𝑎𝑛𝑑𝑜𝑚_𝑓 𝑜𝑟𝑒𝑠𝑡 then

7: return 𝑏𝑜𝑢𝑛𝑑𝑠 ←[(0.1,0.4),(50,1000)]

8: else if 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡_𝑏𝑜𝑜𝑠𝑡𝑖𝑛𝑔 then

9: return 𝑏𝑜𝑢𝑛𝑑𝑠 ←[(0.1,0.4),(50,1000),(0.1,0.3)]

10: end if

11: end procedure

12: procedure ALGOR ITHM_IN IT ⊳Algorithm initialization

13: 𝑐1←[1,2] ⊳c1 is tuned using two diﬀerent values: 1 and 2

14: 𝑐2←2

15: 𝑤←0.5

16: 𝑟1←𝑟𝑎𝑛𝑑𝑜𝑚,𝑟2←𝑟𝑎𝑛𝑑 𝑜𝑚

17: 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒 ←15

18: 𝑠𝑤𝑎𝑟𝑚 ←𝑟𝑎𝑛𝑑𝑜𝑚(𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒𝑠, 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒)

19: 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 ←30

20: return 𝑠𝑤𝑎𝑟𝑚,𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒 ,𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,𝑐1,𝑐2,𝑟1,𝑟2,𝑤

21: end procedure

22: procedure EVALUTE_FI TNESS(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒)

23: 𝑓 𝑖𝑡𝑛𝑒𝑠_𝑣𝑎𝑙𝑢𝑒 ←𝑓1_𝑠𝑐𝑜𝑟𝑒(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐𝑙 𝑒)

24: return 𝑓𝑖𝑡𝑛𝑒𝑠_𝑣𝑎𝑙 𝑢𝑒

25: end procedure

26: procedure UPDATE_VE LOCITY (𝑐1, 𝑐2, 𝑟1, 𝑟2, 𝑤, 𝑔𝑙𝑜𝑏𝑎𝑙 _𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛,

𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦)

27: 𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 =𝑐1 ∗ 𝑟1∗(𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 −𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)

28: 𝑠𝑜𝑐𝑖𝑎𝑙 =𝑐2 ∗ 𝑟2∗(𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 −𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)

29: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦 =𝑤∗𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦 +𝑠𝑜𝑐𝑖𝑎𝑙 +𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒

30: return 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦

31: end procedure

32: procedure UPDATE_PO SITION(𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)

33: update particle position:

34: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 =𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 +𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛

35: return 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛

36: end procedure

age of the precision and recall, taking both false positives

and false negatives into account, contrary to the accuracy

which takes only the true positives and true negatives into

account. Moreover, 𝑓1metric considers uneven or unbal-

anced datasets, where the target classes are not balanced. In

this case, our objective is to consider only the settings of the

models, which give us the highest values of 𝑓1score. Thus,

we deﬁne the objective function to be the maximization of

these values. This process will also prevent our algorithm

from falling into the local optimum and converge to a global

one.

Search space: Since we are using ensemble methods as par-

ticles’ evaluation/ﬁtness function, we should choose the ap-

propriate set of hyper-parameters to be fed to the models.

Depending of the type of the model, i.e., bagging vs boost-

ing Bühlmann (2012), we select the most relevant hyper-

parameters that are of high importance to the learning pro-

cess of the learner/model (e.g., number of trees/estimators,

respective sizes for each of the training and testing splits,

etc.). This set of hyper-parameters will deﬁne the dimen-

sional space used by our algorithm in order to search for

possible optimal solutions.

For bagging techniques, there are two major types of

hyper-parameters that need to be investigated and optimized,

namely, test data size and number of estimators (trees). The

ﬁrst one deﬁnes the size of testing data on which the model

should be tested, and consequently the size of training data

will be deduced automatically. It is generally recommended

Aniss Chohra et al.: Preprint submitted to Elsevier Page 6 of 19

Optimized Feature Selection for Network Anomaly Detection

to set testing data size smaller than that of training set (be-

tween 10% and 40%), thus we set the lower bound as 10%

and upper bound as 40%. This hyper-parameter will be de-

ﬁned as the ﬁrst dimension of each particle and based on

it, the ﬁtness function will decide on how to split the input

dataset and train the appropriate ensemble model.

The second hyper-parameter that needs to be optimized

is the number of estimators, which represents the number of

decision trees that are part of the ensemble learning model.

Normally, the bigger the number of trees is, the better the

overall ensemble model will perform. However, there is a

limitation to this; at some point this improvement stops and

will start decreasing and result in badly predicted samples

and even overﬁtting. In addition, the bigger the number of

trees, the more computational cost is incurred, making the

experimentation more diﬃcult for large-scale datasets. In

general practice, this hyper-parameter is decided with ex-

haustive experimentation by initiating the number of trees

with the smallest value, and at each iteration increasing it

slightly to improve the model’s performance compared to

the previous experimentation results. However, ﬁnding the

optimal number of estimators is very time consuming, es-

pecially in the case of large datasets which leads to days or

even months of experimentation. Moreover, this approach

does not exhaustively explore all the possible optimal val-

ues for the hyper-parameters; it is mainly performed based

on the knowledge and experiences of the experts. There-

fore, we propose to integrate this parameter within the opti-

mization algorithm as a hyper-parameter to be optimized for

the global solution. To improve the scalability of the opti-

mization algorithm, we choose this hyper-parameter to have

a value between 50 (lower bound) and 1000 (upper bound).

In boosting methods, new trees are added to the model in

order to correct the mistakenly predicted samples (residuals)

by the previous tree. This process has two eﬀects; the ﬁrst

one, which can be considered as a beneﬁt, consists of faster

training times compared to bagging techniques. The second

one can be considered as a disadvantage, which makes the

model being more prune to overﬁtting. To overcome this

problem, the learning rate can be seen as a weight (percent-

age), which is introduced to control and reduce the number

of corrections to be made by the current tree (e.g., predic-

tor/classiﬁer) from the previous one. As a result, the overall

performance of the model is improved when the learning rate

is much smaller and the number of trees is higher. There-

fore, in addition to the above-mentioned hyper-parameters,

the boosting ensemble methods require the third essential

hyper-parameter, learning rate, to be optimized. In gen-

eral practice, it is recommended to deﬁne this parameter to

a value between 0.1(lower bound) and 0.3(upper bound).

Thus, our optimization algorithm’s search space for the

bagging methods is deﬁned as a 2-dimensional space, where

the ﬁrst dimension represents the test size, and the second

one is the number of decision trees/estimators included in the

ensemble model. For the boosting methods, our search space

is deﬁned as a 3-dimensional space, where each particle has

three parameters: (test size, number of trees, learning rate).

Algorithm initialization: We start by initializing the set-

tings for our PSO algorithm. First, we deﬁne the maximum

number of iterations, which can be viewed as the number

of chances given to the swarm in order to ﬁnd the optimal

solutions. This parameter is primordial and essential since

in optimization problems we only know that the problem to

be solved might have multiple optimal solutions. However,

we do not know the exact number of these optima; if the

number of chances is too large, the algorithm in question

can take tremendous amounts of times. On the other hand,

the performance of the optimization algorithm to ﬁnd more

optimal solutions gets better when the number of iterations

increases. However, to limit the time consumption factor,

we ﬁx this setting to a value of 30 iterations.

Moreover, we need to deﬁne the values for the accel-

eration constants (𝑐1and 𝑐2) and the weight (𝑤)Kennedy

and Eberhart (1995). For the ﬁrst ones, it is recommended

to set them such that their product is between 0and 4(0≤

𝑐1 × 𝑐2≤4)Marini and Walczak (2015). We run the algo-

rithm two times; the ﬁrst time we set these two constants to

equal values (set both to 2), whilst in the second execution,

we give more importance (speed) to the global solution by

setting 𝑐2to 2and 𝑐1to 1. The intuition is to start by giving

the same importance to the local and global solutions, then

increase the importance of the global solution (𝑐2) and check

which setting allow us to explore better optimal solutions.

Next, we initialize the swarm (population of particles)

by ﬁrst deﬁning a ﬁxed number of particles (individuals)

(𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒), which consists of the swarm. For each par-

ticle, we randomly generate its respective velocity and po-

sition such that they are selected within our pre-deﬁned

bounds (search space deﬁnition). We initialize the global

ﬁtness value (𝑓1) of the overall swarm to be equal to 0.5.

Iterative search: During the iterative process, each initi-

ated particle in the initial swarm gets evaluated using the

ﬁtness function (ensemble model classiﬁer) using its own

coordinates. If after the particle’s ﬁtness evaluation, the 𝑓1

score of that particle is found to be greater than the global

(swarm’s) 𝑓1score, then it ﬁrst updates the global 𝑓1score

to its own value, and sets the global solution’s position to its

own position. Next, it updates its position (particle) using

the appropriate velocity and position functions presented in

Algorithm 2 (line 29 and line 34, respectively).

As for the time complexity of this process, it is of the

order of 𝑂(𝑛𝑚); where 𝑛represents the maximum number

of iterations (line 14 in Algorithm 1) and 𝑚represents the

population/swarm size (line 15 in Algorithm 1). However,

since in our experiments 𝑛and 𝑚have small values (the max-

imum number of iterations is equal to 30 and the size of the

swarm is equal to 15), our approach does not encounter high

time complexity issue. As for the ﬁtness function (line 16

in Algorithm 1), the (training) time complexity of the mod-

els (i.e., Random Forests or XGBoost) is of the order of

𝑂(𝑘.𝑣.𝑛.𝑙𝑜𝑔(𝑛)), where 𝑘is the number of trees, 𝑣is the num-

ber of features, and 𝑛is the number of records/rows. Due to

the presence of a bottleneck in our algorithm for evaluat-

ing the ﬁtness of each particle (either by training Random

Aniss Chohra et al.: Preprint submitted to Elsevier Page 7 of 19

Optimized Feature Selection for Network Anomaly Detection

Forests or XGBoost models), we leverage multiprocessing

paradigm by taking advantage of 20 CPU cores of our setup

environment. On the other hand, since we use a server with

128 GB of RAM (presented in subsection 4.1), the space

complexity does not consist a bottleneck in our algorithm.

Therefore, our approach is suﬃciently eﬃcient and scalable

on the studied datasets and their respective models. Per-

formed experiments (reported in subsection 4.6) conﬁrm the

scalability and eﬃciency of our proposed approach.

It is worth noting that one particle can fall into the case

where the second dimension (number of trees) is not an in-

teger value. This is problematic due to the fact that the num-

ber of decision trees making the ensemble model ought to

be an integer value. Therefore, in that case, we round the

value to the closest integer value. Furthermore, if at any it-

eration, a particle novel position is found to be outer of the

search-space bounds (e.g., [0.1,0.4] for ﬁrst dimension and

[50,1000] for the second one), we correct the out of bound

value to the closest bound. For instance, if a new particle’s

position is (0.5,1200), we correct it to position (0.4,1000).

Optimal solutions: Finally, after the maximum iterations

are reached, we apply a maximization function, which takes

as input all the possible solutions explored during the itera-

tive search and returns only the ones with the highest ﬁtness

value (𝑓1score). If more than one optimal solution is found

(giving the same 𝑓1score value), we select the one with a set

of hyper-parameters that induce the best eﬃciency (e.g., ex-

ecution time and CPU usage). Then, the appropriate model

using the selected optima’s hyper-parameters is trained and

only those features with importance values higher than the

average of all features importance are selected for the next

phase (e.g., anomaly detection).

3.3.2. Deep Learning-Based Anomaly Detection

After selecting the optimal feature selection model and

using it to select the most relevant features, we use the ﬁl-

tered dataset using selected features to generate an eﬃcient

anomaly detection model using autoencoders. To this end,

we start by taking the most accurate models for that dataset

from the existing state-of-the-art models. We aim at re-

ducing our search for the appropriate model selection by

using the most eﬃcient one proposed as a starting point.

Then, we feed the model with only the selected features of

the dataset, which will help reduce the autoencoder model’s

training time. Since we do not use all the features, thus the

compression and bottleneck generation (encoder) as well as

the reconstruction phase (decoder) will run faster compared

to the case of feeding all the features as input.

There are multiple hyper-parameters that we need to tune

in order to ﬁnd the optimal autoencoder model: batch size,

loss function,number of layers,number of neurons, and reg-

ularizations. Once we reach a point where our model outper-

forms the state-of-the-art models (e.g., Yang et al. (2019)),

we stop the search algorithm and select that model as the

optimal one. We then use l1_norm to compute the distance

between the input and the reconstruction data. This result is

then compared to the input labels (ground truth) and diﬀer-

Listing 1: Malicious and Benign Traﬃc Logs Sources

−−−−−−−−−−−−−−−−− Malicious Traﬃc Logs Sources −−−−−−−−−−−−−

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet

−370−1/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet

−371−1/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet

−372−1/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet

−373−1/bro/conn.log

−−−−−−−−−−−−−−−−− Benign Traﬃc Logs Sources −−−−−−−−−−−−−−−

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−25/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−26/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−27/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−28/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−29/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−30/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−31/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−32/bro/conn.log

∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−33/bro/conn.log

ent threshold values are tuned to select the one that gives the

highest accuracy metrics on each dataset.

3.4. IoT-Zeek Dataset Generation

In the following, we describe the methodology used to

generate the IoT-Zeek dataset of malicious and benign net-

work traﬃc. We ﬁrst deploy a real environment which con-

sists of various raspberry pi devices that communicate with

each other. We install Zeek sensors to monitor the network

traﬃc and extract connection logs (conn.log) generated by

the Zeek NIDS Paxson (1999); Team (2018). Then, we in-

ject diﬀerent malware samples to these devices and capture

malicious network traﬃc. These connection logs are then

classiﬁed using classical machine learning and deep learning

models into malicious or benign connections (as explained

later). A portion of the dataset, which contains 150,000

records (connections), consisting a total of 129,441 mali-

cious connections and 20,559 benign connections, is sam-

pled.

Malware and Benign Sources: To ensure the freshness of

our dataset regrading the malicious / benign IP addresses,

we collect PCAPS from both Concordia SecLab malware

feed (in house source) and Stratosphere Research Labora-

tory Laboratory (2018). Then, we build the global training

dataset from the labeled Zeek logs. The malicious traﬃc

logs and the benign traﬃc logs are retrieved from the sources

presented in Listing 1. The number of malicious and be-

nign connections in the evaluation dataset are 1,764,604 and

278,998, respectively.

3.4.1. Maliciousness Classiﬁcation

In this section, we present employed ensemble models

to classify the malicious activities on the IoT-Zeek dataset.

As depicted in Figure 3, the ﬁrst set of models belongs to

classical machine learning, while the second one belongs to

deep learning. The input to the models is all connection log

features as presented in Table 1 (there exist some other fea-

tures, which are not extracted from PCAP ﬁles by Zeek in

Aniss Chohra et al.: Preprint submitted to Elsevier Page 8 of 19

Optimized Feature Selection for Network Anomaly Detection

Figure 3: Maliciousness Detection Pipeline.

Table 1

Zeek’s Connection Log File Features Description.

Feature Typ e Description

ts Numerical Unix Timestamp format of the connection’s occurance date.

id.orig_h Categorical Originator’s IP address.

id.orig_p Categorical Originator’s TCP/UDP port.

id.resp_h Categorical Responder’s IP address.

id.resp_p Categorical Responder’s TCP/UDP port.

proto Categorical The transport layer protocol (tcp, udp, or icmp).

service Categorical The application layer requested protocol (e.g; ssh, dns, etc.)

orig_ip_bytes Numerical Number of bytes sent from the originator to the responder; this is extracted from the packet header.

resp_ip_bytes Numerical Number of bytes sent from the responder to the originator.

orig_pkts Numerical Number of packets sent from the originator to the responder.

resp_pkts Numerical Number of packets sent from the responder to the originator.

conn_state Categorical A string giving an overview description about the state of the connection.

history Categorical A string giving more details about the state of the connection.

the default setting)5.

As for the classical machine learning classiﬁcation, we

employ RandomForest,XGBoost,LightGBM, and CatBoost

classiﬁers. We choose these classiﬁers due to their high

performance and reputation in the industry. Moreover, the

chosen classiﬁers were parts of many winning solutions in

machine learning competitions6. In addition to the classi-

cal machine learning models, we deploy two deep learning

models for maliciousness detection. This includes the con-

volutional neural networks (CNN) and the feed forward neu-

ral networks (NN). More speciﬁcally, we customize the ar-

chitecture of CNN model to learn the maliciousness of the

network traﬃc, as shown in Figure 4. Moreover, the details

of the model are presented in Table 2. Other parameters,

such as Filters, are obtained from experiments and trade oﬀ

between the size of the model and the accuracy of the model.

Kernel and Stride have pretty standard values in many ma-

chine learning papers in the context of CNN. The feed for-

ward neural network architecture is a typical neural network

with fully connected layers. The details of the model are

listed in Table 3.

Ensemble Models: Training the aforementioned machine

learning classiﬁers on the connection logs features produces

a set of models 𝑀= {𝑐𝑀1, 𝑐𝑀2, 𝑐𝑀3, 𝑐𝑀4, 𝑑𝑀1, 𝑑 𝑀2},

5https://docs.zeek.org/en/lts/scripts/base/protocols/conn/main.

zeek.html

6https://www.kaggle.com/competitions

Table 2

Dimension Convolutional Neural Network Model Details.

Block # Layers Options

Block1

1 Conv Filter=64, Kernel=(3,1), Stride=(1,1),

ZeroPadding, Activation=ReLU

2 BNorm BatchNormalization

3 MaxPooling Kernel=(2,2), Stride=(2,2), Zero-Padding

Block2

4 MaxPooling Global Max Pooling

5 Fully Connected #Output=512, Activation=ReLU

6 Fully Connected #Output=1, Activation=Sigmoid

Table 3

Feed Forward Neural Network Model Details.

# Layers Options

1 Fully Connected #Output=128, Activation=ReLU

2 Batch Normalization Batch Normalization

3 Fully Connected #Output=256, Activation=ReLU

4 Batch Normalization Batch Normalization

5 Fully Connected #Output=512, Activation=ReLU

6 Batch Normalization Batch Normalization

7 Fully Connected #Output=512, Activation=ReLU

8 Fully Connected #Output=1, Activation=Sigmoid

where 𝑐𝑀𝑖represents a classical machine learning mod-

el/learner (i.e., RandomForest,XGBoost,LightGBM, and

CatBoost classiﬁers) and 𝑑𝑀𝑖represents a deep learning

model/learner (i.e., CNN and NN). To perform ensemble

learning, we use ensemble averaging technique as presented

Aniss Chohra et al.: Preprint submitted to Elsevier Page 9 of 19

Optimized Feature Selection for Network Anomaly Detection

Figure 4: CNN Maliciousness Detection Model’s Architecture.

in Equation 5, as follows:

𝑌′(𝑋, 𝛼 ) = |𝑀|

∑

𝑖=1

𝛼𝑖×𝑦𝑖× (𝑋)(5)

where 𝑌′is the ensemble probability likelihood, 𝑋is the

input feature vector, 𝛼𝑖are weights, and 𝑦𝑖are the prob-

ability likelihood of each single model. Each individual

model/learner (classical machine learning and deep learn-

ing models as presented in Figure 3) produces a single

probability (𝑦𝑖), which represents the maliciousness likeli-

hood. These models detection’ probabilities are input to

the weighted average ensemble. This technique employs a

weighted average using 𝛼𝑖weights to produce the ensemble

prediction. In the current setting, we choose 𝛼𝑖= 1 for all

the models, which indicates that all the models have equal

contribution in the ﬁnal decision.

3.4.2. System Adaptation

Adaptation to new network threats and attacks is an im-

portant criterion in the network traﬃc malicious detection.

In our context, we provide this capability thought the au-

tomation of the model generation process. As shown in Fig-

ure 5, the system leverages a feed of malicious traﬃc (in

form of PCAP ﬁles) to build an updated training dataset ev-

ery epoch. The updated training dataset is representative of

the state-art-the-art network attacks and benign traﬃcs. The

system insures the quality of the produced model by using

validation and testing datasets, and only models that achieve

high detection performance will be deployed in production.

4. Evaluation Results and Discussion

In this section, we ﬁrst describe the experimental setup,

and the benchmark datasets. Then, we provide more details

on the validation of our proposed feature selection approach

on each of the chosen benchmark datasets. Next, we report

the accuracy of our anomaly detection model on diﬀerent

datasets and compare our results with the state-of-the-art ap-

proaches. Finally, we provide the results of our eﬃciency

study.

4.1. Experimental Setup

All our experiments are conducted on a computation

server with an Intel Xeon E5-2630 2.30 GHz CPU with 24

cores and 128 GB of RAM, and CentOS Linux version 7

installed on it. Our system prototype is developed using

Python 3.6 programming language and PyTorch by leverag-

ing sklearn and Scikit libraries for bagging ensemble learn-

ing techniques (random forest classiﬁer), xgboost library for

boosting ensemble techniques (gradient boosting classiﬁer)

and other machine learning models. Multiprocessing is de-

ployed for fast models’ training by taking advantage of 20

cores of the CPU for both Random Forest and XGBoost clas-

siﬁers. We use pandas API in order to load and preprocess

each API. We adapt the autoencoders models by utilizing the

keras API in conjunction with tensorﬂow.

Evaluation Metrics. To evaluate the performance of our

approach, we use the accuracy, precision, recall and 𝐹1score

metrics that are deﬁned as follows:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇 𝑃 +𝑇 𝑁

𝑇 𝑃 +𝑇 𝑁 +𝐹 𝑃 +𝐹 𝑁

𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃

𝑇 𝑃 +𝐹 𝑃 , 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇 𝑃

𝑇 𝑃 +𝐹 𝑁

𝐹1= 2 ⋅

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙

4.2. Benchmark Datasets Description

In this subsection, we introduce the two benchmark

datasets as follows.

NSL-KDD Dataset: A network dataset, called NSL-KDD

(Tavallaee et al. (2009)), was proposed to ﬁx two main issues

(e.g., redundant records and level of diﬃculty) related to its

predecessor KDD’99 dataset. The updated dataset (NSL-

KDD) contains a total of 148,517 network ﬂow records, with

77,054 being labeled as normal records and 71,463 being

Aniss Chohra et al.: Preprint submitted to Elsevier Page 10 of 19

Optimized Feature Selection for Network Anomaly Detection

Figure 5: Machine Learning Models Development.

labeled as attacks. The dataset consists of a total of 41 fea-

tures, 32 of which are numerical (integer or ﬂoat) type and 9

features have categorical values.

UNSW-NB15 Dataset: The UNSW-NB15 dataset

(Moustafa and Slay (2015); Moustafa and Slay (2016a);

Moustafa et al. (2019); Moustafa et al. (2017)) was created

by the Cyber Range Laboratory for Cyber Security (ACCS)

using IXIA PerfectStorm framework that contain real normal

and attack behaviours. Tcpdump is then used as a framework

to capture 100GB of network traﬃc activity. The dataset

consists of nine types of cyber attacks: Fuzzers,Analysis,

Backdoors,Denial of Sevice (DoS),Exploits,Generic,Re-

connaissance,Shellcode, and Worms. Moreover, generated

dataset contains a total of 49 features, 42 of which are

numerical (integer or ﬂoat) type, and 6 are of categorical

type. This dataset contains 2,218,761 normal and 321,283

attack records.

4.3. Feature Selection Results

We apply our proposed optimized feature selection so-

lution on the three aforementioned datasets. The results for

each of the two explored ﬁtness functions, Random Forests

(bagging) and XGBoost (boosting), are detailed respectively

in Table 4 and Table 5. We observe that the latter ﬁtness

function (XGBoost) achieves the highest ﬁtness values (𝑓1

score) on all three datasets. Moreover, when the 𝑐2accel-

eration constant is higher than 𝑐1(𝑐2=2and 𝑐1=1), the

algorithm performs better in ﬁnding better optimal solutions

for two of the datasets, while for the NSL-KDD dataset both

settings give the same values of 𝑓1score.

Afterwards, for each of these set of hyper-parameters se-

lected for each dataset, we train the appropriate model (XG-

Boost), and extract the list of features with their correspond-

ing importance values. Then, we compute their averages and

select only the ones which their importance is higher than

the obtained average value. The results of this process are

presented in Table 6.

Eﬀects of Imbalanced Data: We further examine the ef-

fects of imbalanced data on our feature selection approach.

As presented in Section subsection 3.4, IoT-Zeek dataset has

a smaller number of benign connections compared to mali-

cious connections, which may inﬂuence machine learning

algorithms to ignore the minority class. According to the

literature Fernández et al. (2018), oversampling and under-

sampling techniques are recommended to overcome this is-

sue. To this end, we leverage SMOTE and RUS python li-

braries7and apply both oversampling and undersampling

techniques on the IoT-Zeek data. According to obtained re-

sults, oversampling technique slightly outperforms under-

sampling technique. Consequently, we consider the over-

sampled dataset during our experiments and refer to it as

IoT-Zeek-Oversampled dataset.

We apply our optimized feature selection solution on the

IoT-Zeek-Oversampled dataset. The results of the two ex-

plored ﬁtness functions are presented in Table 7. We ob-

serve that the XGBoost ﬁtness function achieves the high-

est ﬁtness values (𝑓1score). More speciﬁcally, when the 𝑐2

acceleration constant is equal to 𝑐1(𝑐2 = 2 and 𝑐1 = 2),

the XGBoost algorithm outperforms in ﬁnding better opti-

mal solutions. Afterwards, for each of the selected set of

hyper-parameters, we train the appropriate XGBoost model,

7https://imbalanced-learn.org/stable/index.html

Aniss Chohra et al.: Preprint submitted to Elsevier Page 11 of 19

Optimized Feature Selection for Network Anomaly Detection

Table 4

Feature Selection Results using Random Forests Classiﬁer as Fitness Function (Acceleration

constant c2 is ﬁxed to 2 whilst c1 is tuned, 𝑓1score is the ﬁtness function)

Dataset c1 Test size #Trees Accuracy 𝑓1score Precision Recall

NSL-KDD

20.1 71 99.52 99.52 99.52 99.52

0.1 70 99.52 99.52 99.52 99.52

0.1 323 99.51 99.51 99.51 99.5

0.103 295 99.51 99.51 99.51 99.51

0.12 50 99.5 99.51 99.51 99.5

0.1 107 99.52 99.51 99.52 99.51

0.15 258 99.51 99.51 99.51 99.5

0.2 50 99.52 99.51 99.52 99.51

0.1 424 99.5 99.5 99.5 99.5

10.1 50 99.549 99.549 99.549 99.54

0.1 63 99.51 99.51 99.51 99.51

0.1 51 99.51 99.51 99.51 99.51

0.1002 153 99.51 99.51 99.51 99.51

0.105 50 99.51 99.51 99.51 99.51

UNSW-NB15

20.1 50 99.93 99.49 99.28 99.69

0.1 84 99.93 99.43 99.15 99.71

0.103 50 99.93 99.42 99.17 99.68

10.1 1000 99.92 99.4 99.1 99.70

0.1005 1000 99.92 99.4 99.1 99.70

0.106 1000 99.92 99.4 99.09 99.70

IoT-Zeek Dataset

20.1 724 99.99 99.99 100 99.98

0.111 106 99.99 99.99 100 99.98

0.112 980 99.99 99.99 100 99.98

10.214 80 99.99 99.99 100 99.98

0.103 111 99.99 99.99 100 99.98

0.295 50 99.99 99.99 100 99.98

Table 5

Feature Selection Results using XGBoost Classiﬁer as Fitness Function (Acceleration con-

stant c2 is ﬁxed to 2 whilst c1 is tuned, 𝑓1score is the ﬁtness function).

Dataset c1 Test size #Trees Learning rate Accuracy 𝑓1score Precision Recall

NSL-KDD

20.102 376 0.162523 99.75 99.75 99.75 99.75

0.13 327 0.138372 99.75 99.75 99.75 99.73

0.104 292 0.16305 99.75 99.75 99.75 99.75

0.1 233 0.1473 99.739 99.739 99.739 99.73

0.10558 241 0.17077 99.73 99.73 99.7 99.73

10.106 680 0.1 99.75 99.75 99.75 99.75

0.105 681 0.1 99.75 99.75 99.75 99.73

0.106 686 0.1 99.75 99.75 99.75 99.75

UNSW-NB15

20.1 903 0.1 99.97 99.76 99.71 99.82

0.1 824 0.1 99.97 99.76 99.70 99.82

0.1 827 0.1 99.97 99.76 99.71 99.82

0.16 903 0.102 99.90 99.60 99.70 99.8

0.1 899 0.10013 99.90 99.60 99.70 99.8

10.1 1000 0.1 99.90 99.80 99.80 99.87

0.1 1000 0.137 99.96 99.70 99.65 99.77

0.158 1000 0.141 99.96 99.71 99.68 99.74

0.17 816 0.12 99.96 99.69 99.67 99.71

0.114 425 0.159 99.96 99.67 99.57 99.77

0.137 481 0.125 99.96 99.67 99.61 99.73

IoT-Zeek Dataset

20.4 1000 0.294 99.90 99.90 99.90 99.90

0.4 1000 0.158 99.90 99.90 100 99.90

0.325 730 0.3 99.90 99.90 100 99.90

0.1 411 0.1 99.90 99.90 100 99.98

0.369 510 0.3 99.99 99.99 100 99.99

0.360 624 0.3 99.99 99.99 100 99.98

0.397 382 0.132 99.99 99.99 99.99 99.99

0.361 664 0.3 99.99 99.99 100 99.97

and extract the list of features with their corresponding im-

portance values. Then, we select only the features with an

importance higher than their average values, as presented in

Table 8.

4.4. Anomaly Detection Results

In this section, we describe the architecture of our au-

toencoder models for each of the utilized datasets, and

call them NSL-KDD Model, UNSW-NB15 Model, IoT-

Zeek Model, and IoT-Zeek-Oversampled Model. Then, we

present the results of models, and then we compare the re-

sults obtained for each dataset’s model with the state-of-the-

art approaches presented in Yang et al. (2019).

NSL-KDD Model: After several iterations of model train-

ing, we found that the optimal anomaly detection model for

this dataset has ﬁve hidden layers: two for the encoder (128

and 64 neurons respectively), one layer for the bottleneck

(32 neurons), and two others for the decoder (64 and 128

Aniss Chohra et al.: Preprint submitted to Elsevier Page 12 of 19

Optimized Feature Selection for Network Anomaly Detection

Table 6

Selected Features on each Dataset using the Optimal Solution Hyper-parameters. (Accel-

eration constant 𝑐2is ﬁxed to 2and 𝑐1is tuned between 1and 2).

Dataset & Model Feature name Feature importance

NSL-KDD src_bytes 0.298222400

Test size: 0.106 num_failed_logins 0.131071240

Number of trees: 680 service 0.074615410

Learning rate: 0.1 diﬀ_srv_rate 0.054890107

ﬂag 0.039971426

hot 0.037307087

count 0.027289085

dst_host_srv_diﬀ_host_rate 0.025930267

dst_host_same_srv_rate 0.024637770

UNSW-NB15 sttl 0.087299424

Test size: 0.1 ct_state_ttl 0.059610307

Number of trees: 1000 dsport 0.018620330

Learning rate: 0.1 proto 0.007498254

IoT-Zeek Dataset ts 0.672937750

Test size: 0.369 id_orig_p 0.121201570

Number of trees: 510 history 0.110579970

Learning rate: 0.3 resp_ip_bytes 0.089164086

Table 7

Feature Selection Results on IoT-Zeek-Oversampled Dataset.

Fitness Function (𝑓1score) c1 Test size #Trees Accuracy 𝑓1score Precision Recall

Random Forest (C2=2)

20.4 478 99.99 99.99 99.99 99.99

0.4 518 99.99 99.99 99.99 99.99

0.4 534 99.99 99.99 99.99 99.99

10.4 50 99.99 99.99 99.99 99.99

0.4 116 99.99 99.99 100 99.99

0.4 431 99.99 99.99 99.99 99.99

(C2=2)

20.4 50 100 100 100 100

0.3054 50 100 100 100 100

0.4 50 100 100 100 100

10.4 492 99.99 99.99 99.99 99.99

0.3963 679 99.99 99.99 99.99 99.99

0.4 739 99.99 99.99 99.99 99.99

respectively). In addition, we used two activities regular-

ization functions to deal better with the overﬁtting, namely,

dropout=0.5 and l2 norm for kernel regularization at each

layer with a value of 0.001 (as shown in the ﬁrst part of Ta-

ble 9). Moreover, this autoencoder is a deeply connected

autoencoder, such that all layers are of Dense layer type.

Each of these layers uses Relu as activation function. The

optimal size of batches is set to be 32 with the testing data

set size equal to the optimal one found in the optimiza-

tion algorithm (0.106). Additionally, we tuned the model

with three diﬀerent loss functions: categorical crossentropy,

mean squared error, and mean absolute error. The results

of this model validation with diﬀerent thresholds are pre-

sented in Table 10. As can be seen, the model performs bet-

ter with categorical crossentropy as the loss function, with an

optimal threshold equal to 0.512, achieving approximately

92.09% average of 𝑓1score metric.

UNSW-NB15 Model: By using this dataset and after ap-

plying the same model tuning steps used for the NSL-KDD

dataset, we found that the optimal model has exactly the

same regularization values at each layer (dropout=0.5 and

kernel_regularizer_l2=0.001). However, there are two dif-

ferences compared to the previous model on the NSL-KDD

dataset. First, with this dataset there are exactly seven hid-

den layers (enocder=[512,256,128], bottleneck=[64], and

Table 8

Selected Features on IoT-Zeek-Oversampled Dataset using the Optimal Solution Hyper-

parameters.

Model Feature name Feature importance

ts 0.254218453

Test size: 0.3905 resp_ip_bytes 0.161528458

Number of trees: 50 resp_pkts 0.152892584

Learning rate 0.2648 resp_bytes 0.084122151

id_orig_p 0.083250296

Aniss Chohra et al.: Preprint submitted to Elsevier Page 13 of 19

Optimized Feature Selection for Network Anomaly Detection

Figure 6: Deep Learning Anomaly Detection: Train and Validation Loss.

Figure 7: Autoencoder Anomaly Detection ROC Curves.

decoder=[128,256,512]), as shown in the second part of Ta-

ble 9, with the batch size being set to 64. Second, this model,

contrary to the previous one, performs better with the mean

squared error loss function, achieving an optimal threshold

of 2.239 with an overall 𝑓1score average equal to 92.904 (as

shown in Table 10).

IoT-Zeek Model: We achieve an overall average of 𝑓1score

equal to 97.302 on this dataset (as shown in Table 10) using

mean squared error loss function. The autoencoder model

for this dataset is the same as the one for NSL-KDD, except

that we give it a smaller value for the kernel regularization

function (0.0001), as shown in the third part of Table 9.

IoT-Zeek-Oversampled Model: We apply the same hyper-

parameters that were tuned for the NSL-KDD Model as

shown in Table 9, and train our autoencoder model on the

IoT-Zeek-Oversampled dataset using mean squared error

loss function. The ROC curve is shown in Figure 8, where

the model achieves 99% area under the curve (AUC). More-

over, obtained results for diﬀerent threshold values are pre-

sented in Table 10. As seen, we achieve an 𝑓1score of

94.300 on the oversampled data, while the obtained 𝑓1score

on the original data was 97.302. The reason for the 3% drop

in the 𝑓1score can be explained with the selected features

and their importance before and after oversampling, as pre-

sented in Table 6 and Table 8, respectively. Since feature

selection technique applies statistical methods to ﬁnd the

best features, if the population (the dataset) changes (due to

over/under -sampling), the importance of selected features

will most likely be diﬀerent, which will aﬀect the overall

performance.

Figure 8: Autoencoder Anomaly Detection ROC Curve on IoT-

Zeek-Oversampled Dataset.

We further measure the training and validation loss for

each of the aforementioned models as presented in Figure 6;

where we can see that for all three models, there is no over-

ﬁtting and each model’s loss becomes stable around epoch

6. Moreover, Figure 7 shows the Receiver Operating Char-

acteristic (ROC) curve for each one of the aforementioned

models (NSL-KDD Model, UNSW-NB15 Model, and IoT-

Zeek Model explained in subsection 4.4), where all the three

models achieve more than 90% area under the curve (AUC),

with IoT-Zeek Model achieving almost 100% AUC.

4.5. Comparative Study

We further compare our two models that we trained

on both NSL-KDD and UNSW-NB15 benchmark datasets,

with the most prominent state-of-the-art anomaly detection

recent models (e.g., the ones presented in Yang et al. (2019))

applied on these datasets. According to our experiments,

our proposed autoencoders outperform them in terms of 𝑓1

score metric. The results of this comparison are depicted in

Figure 9a and Figure 9b, for NSL-KDD and UNSW-NB15

respectively, where for both datasets, our proposed models

achieve the highest values of 𝑓1score. Moreover, we com-

pare the performance of our proposed approach with the

aforementioned selected work in terms of accuracy metric on

Aniss Chohra et al.: Preprint submitted to Elsevier Page 14 of 19

Optimized Feature Selection for Network Anomaly Detection

Table 9

Proposed Autoencoder Architecture by Dataset.

Dataset Encoder Bottleneck Deco der Regularizations

NSL-KDD 1. layer 1: 128 neurons

2. layer 2: 64 neurons

layer 3: 32 neurons 1. layer 4: 64 neurons

2. layer 5: 128 neurons

•Dropout: 0.5

•L2-regularizer: 0.001

UNSW-NB15 1. layer 1: 512 neurons

2. layer 2: 256 neurons

3. layer 3: 128 neurons

layer 4: 64 neurons 1. layer 5: 128 neurons

2. layer 6: 256 neurons

3. layer 7: 512 neurons

•Dropout: 0.5

•L2-regularizer: 0.001

IoT-Zeek 1. layer 1: 128 neurons

2. layer 2: 64 neurons

layer 3: 32 neurons 1. layer 4: 64 neurons

2. layer 5: 128 neurons

•Dropout: 0.5

•L2-regularizer: 0.0001

IoT-Zeek-

Oversampled 1. layer 1: 128 neurons

2. layer 2: 64 neurons

layer 3: 32 neurons 1. layer 4: 64 neurons

2. layer 5: 128 neurons

•Dropout: 0.5

•L2-regularizer: 0.001

Table 10

Chameleon Deep Learning Anomaly Detection Results.

Dataset & Model Threshold Accuracy Precision Recall 𝑓1score

Dataset: NSL-KDD 0.105 86.191 81.481 98.021 88.989

Loss function: categorical crossentropy 0.087 86.072 80.618 99.439 89.045

Training Time: 6mins, 13sec 0.177 87.833 84.065 97.016 90.077

2.190 89.092 90.753 90.010 90.380

1.758 89.532 90.640 91.008 90.824

1.018 89.607 89.965 92.005 90.974

0.837 90.073 89.900 93.010 91.429

0.314 90.006 87.624 96.002 91.622

0.742 90.592 89.922 94.008 91.920

0.512 90.711 89.351 95.005 92.092

Dataset: UNSW-NB15 1.352 84.382 79.346 98.099 87.731

Loss function: mean squared error 1.342 84.302 78.795 99.088 87.784

Training Time: 28mins, 38sec 2.365 86.058 86.124 90.010 88.024

2.346 86.387 85.913 91.008 88.387

2.325 86.728 85.705 92.036 88.758

2.305 87.083 85.546 93.026 89.129

2.286 87.456 85.406 94.031 89.511

2.265 87.757 85.169 95.044 89.836

2.162 87.629 83.805 97.016 89.927

2.239 89.523 90.00 96.002 92.904

Dataset: IoT-Zeek 3.101 98.158 96.344 90.004 93.066

Loss function: mean squared error 3.098 98.288 96.329 91.002 93.590

Training Time: 8mins, 32sec 3.097 98.407 96.235 92.001 94.070

3.094 98.530 96.157 93.012 94.558

3.092 98.651 96.080 94.010 95.034

3.090 98.777 96.043 95.009 95.523

3.088 98.894 95.944 96.007 95.975

3.086 99.019 95.897 97.005 96.448

3.084 99.134 95.789 98.003 96.884

3.082 99.246 95.659 99.002 97.302

Dataset: IoT-Zeek-Oversampled 95.921 97.321 90.444 90.004 90.223

Loss function: mean squared error 90.063 97.458 90.538 91.002 90.770

89.015 97.595 90.631 92.001 91.311

87.177 97.734 90.724 93.012 91.854

85.143 97.871 90.813 94.010 92.384

3.364 97.817 86.922 99.002 92.569

82.043 97.979 90.719 95.009 92.814

78.225 98.099 90.694 96.00 93.275

78.037 98.236 90.781 97.005 93.790

76.116 98.373 90.866 98.003 94.300

both datasets. The results of this comparative study are re-

ported in Figure 11 for both NSL-KDD dataset and UNSW-

NB15 dataset.

More speciﬁcally, we compare the accuracy results ob-

tained during the testing (using hold-out dataset) of our au-

toencoders models (NSL-KDD Model and UNSW-NB15

Model) trained on both benchmark datasets with the reported

accuracy results of existing state-of-the-art models tested on

NSL-KDD dataset (e.g, Yang et al. (2019); Ma et al. (2016);

Javaid et al. (2016); Tang et al. (2016); Imamverdiyev and

Aniss Chohra et al.: Preprint submitted to Elsevier Page 15 of 19

Optimized Feature Selection for Network Anomaly Detection

(a) NSL-KDD Dataset. (b) UNSW-NB15 Dataset.

Figure 9: A Comparative Study of Anomaly Detection Approaches in terms of 𝑓1score.

Table 11: A Comparative Study of Anomaly Detection Approaches in terms of Accuracy.

Approach NSL-KDD Dataset UNSW-NB15 Dataset Average Accuracy

Chameleon990.71% 89.52% 90.115%

Rashid et al. (2022) 99.90% 94.00% 96.95%

MemAE Min et al. (2021) 89.51% 85.30% 87.405%

Roy et al. (2022) 98.50% Not Reported -

CNN Mahalakshmi et al. (2021) Not Reported 93.50% -

J48 Roy and Singh (2021) Not Reported 87.65% -

ICVAE-DNN Yang et al. (2019) 85.97% 89.08% 87.525%

GB-RBM Imamverdiyev and Abdullayeva (2018) 73.23% Not Reported -

RNN-IDS Yin et al. (2017) 81.29% Not Reported -

ID-CVAE Lopez-Martin et al. (2017) 80.10% Not Reported -

CASCADE-ANN Baig et al. (2017) Not Reported 86.40% -

DNN Tang et al. (2016) 75.75% Not Reported -

STL Javaid et al. (2016) 74.38% Not Reported -

SCDNN Ma et al. (2016) 72.64% Not Reported -

DT Moustafa and Slay (2016b) Not Reported 85.56% -

EM Clustering Moustafa and Slay (2016b) Not Reported 78.47% -

Abdullayeva (2018); Yin et al. (2017); Lopez-Martin et al.

(2017); Min et al. (2021)) and on UNSW-NB15 dataset (e.g.,

Yang et al. (2019); Baig et al. (2017); Moustafa and Slay

(2016b); Roy and Singh (2021); Mahalakshmi et al. (2021);

Min et al. (2021)) as presented in Figure 11.

In Roy and Singh (2021), the authors examine diﬀerent

anomaly detection classiﬁers on the UNSW-NB15 dataset

before and after applying feature selection. The reported re-

sults show that the J84 classiﬁer achieves the highest accu-

racy of 87.65%, outperforming slightly the case where no

feature selection is applied (with an accuracy of 87.44%).

However, the authors have not measured other performance

evaluation metrics (e.g., f1score, recall, and precision). The

authors of Mahalakshmi et al. (2021) applied a CNN model

on UNSW-NB15 dataset and detect anomalies with an ac-

curacy of 93.5%. However, the authors have not examined

their approach on the NSL-KDD dataset. Moreover, they

have not reported additional performance metrics (e.g., f1

score, recall, precision) during their evaluation.

In Min et al. (2021), the authors introduced MemAE by

using autoencoders to reconstruct the behavior of abnormal

samples that look close to normal ones. They achieve an

accuracy of 89.51% and f1-score of 89.93% on NSL-KDD

dataset, as well as 85.3% accuracy and 85.26% f1-score on

UNSW-NB15 dataset. However, the authors have not con-

sidered using feature selection prior to their autoencoder

anomaly detection model to show the diﬀerence between the

two scenarios.

Roy et al. (2022) propose B-Stacking, a lightweight su-

pervised intrusion detection based on machine learning en-

semble that uses K-Nearest Neighbors (KNN), Random For-

est, and XGBoost to detect network anomalies. The ap-

proach has been evaluated only on the NSL-KDD dataset,

with an accuracy of 98.50% and f1score of 99.00%, and has

not been tested on the UNSW-NB15 dataset. The authors

in Rashid et al. (2022) propose a stacking ensemble tech-

nique (SET) with SelectKBest feature selection technique

and an ensemble of Decision Trees, Random Forest, and

XGBoost machine learning models for network anomaly de-

tection. Performed experiments demonstrate that SET ob-

tains an accuracy and 𝑓1score of 94.00% and 94.00% on

UNSW-NB15 dataset, and 99.90% on both accuracy and 𝑓1

score on NSL-KDD datasets. However, the use of Selec-

tKBest technique is less adaptive to new malicious network

traﬃc over the time. In contrast, our proposed solution,

CHAM ELEO N, employs autoencoder model, which is more

resilient to new threats since our model uses unsupervised

techniques. Moreover, CHAMELE ON has two sub-detection

modules: supervised (XGBoost and Random Forest for clas-

siﬁcation) and unsupervised (deep autoencoders for anomaly

detection), which both module achieve promising accuracy

results. Although our main objectives it to perform anomaly

detection, obtained classiﬁcation results presented in Table 5

and Table 7 demonstrate high 𝑓1scores, which outperform

Aniss Chohra et al.: Preprint submitted to Elsevier Page 16 of 19

Optimized Feature Selection for Network Anomaly Detection

the reviewed existing approaches. The results reported in Ta-

ble 10 obtained from our anomaly detection approach uses

the autoencoders, which are considered high in the context

of anomaly detection (unsupervised).

Amongst the aforementioned works, Yang et al. (2019)

deploy a combination of variational autoencoders and deep

neural networks (DNN) to detect anomalies, which achieves

the highest accuracy and 𝑓1score of 85.97% and 86.27%

on NSL-KDD, and those of 89.08% and 90.61% on UNSW-

NB15. Ma et al. (2016) combine both spectral cluster-

ing and DNN achieving 72.64% of accuracy on NSL-

KDD, and Javaid et al. (2016) deploy self-taught learning

reporting with accuracy of 74.38% on NSL-KDD. Tang

et al. (2016) employ DNN and obtain 75.75% accuracy on

NSL-KDD. Imamverdiyev and Abdullayeva (2018) deploy

a Gaussian-Bernoulli based Recurrent Boltzmann Machine

achieving 73.23% accuracy on NSL-KDD, while Yin et al.

(2017) propose a novel IDS using recurrent neural networks

(RNNs) reporting 81.29% accuracy on NSL-KDD. On the

other hand, Lopez-Martin et al. (2017) propose an intru-

sion detection system based on conditional variational au-

toencoders (CVAE), achieving 80.1% accuracy on the NSL-

KDD dataset. Baig et al. (2017) introduce a novel approach

for intrusion detection using multi-cascading artiﬁcial neural

networks achieving an accuracy of 86.4% on UNSW-NB15

dataset. Moustafa and Slay (2016b) deploy two approaches;

the ﬁrst one uses expectation-maximization clustering tech-

nique in order to detect anomalies eﬃciently achieving

78.47% accuracy on UNSW-NB15 dataset, and the second

approach deploys decision trees on the same dataset and

records an accuracy of 85.56%. However, given all this in-

formation, we notice that our work outperforms the afore-

mentioned state-of-the-art works by achieving 90.711% of

accuracy and 92.092% f1-score on NSL-KDD dataset, and

89.523% of accuracy and 92.904% of f1-score on UNSW-

NB15 dataset.

The advantages of our approach over aforementioned ex-

isting works are as follows:

•Feature selection: where our work is amongst a few

proposed approaches (e.g., Roy and Singh (2021)) that

introduces the selection of the most important fea-

tures through PSO algorithm before applying a de-

tection model. This leads to achieving more accu-

rate model, since feature selection helps ﬁlter non im-

portant/relevant features (noisy data) from the dataset,

which leads to classify each class/label more accu-

rately and results in more accurate models. In addi-

tion, feature selection provides better eﬃciency and

scalability compared to existing models that use the

whole features of the datasets.

•Evaluation on recent real-world IoT dataset: while

existing works evaluated their approaches on the

most common benchmark datasets (NSL-KDD and

UNSW-NB15), none of them conduct experiments on

real-world IoT dataset. On the contrary, we ﬁrst gen-

erate our own real-word IoT dataset, and then apply

our models to that real-world IoT dataset in addition

to those non-IoT datasets. This makes our approach

more realistic and applicable to recent security prob-

lems.

4.6. Eﬃciency

In this section, we examine the execution time for the

optimization feature selection algorithm depending on the

chosen ﬁtness function. The obtained results are presented in

Figure 10. The execution time relatively high for the UNSW-

NB15 dataset), due to the huge number of records as well

as large number of features. However, we do not consider

this as an issue, since the optimized feature selection task is

executed only once on each dataset.

Figure 10: Optimized feature selection execution times.

5. Concluding Remarks and Limitations

Optimization of non-deterministic tasks in machine

learning and deep learning is becoming a new widespread

approach to help developers ﬁnd optimal hyper-parameter

settings and use them to build their classiﬁcation, regression,

or clustering models. This paper presented a novel approach

which focuses on ﬁnding the optimal hyper-parameters for

ensemble methods in order to select the important features

on a given networking dataset. The proposed approach is de-

veloped by combining ensemble methods with a swarm in-

telligence optimization algorithm (PSO). Our validation re-

sults prove that the proposed algorithm ﬁnds the optimal so-

lutions better when tuned with boosting (XGBoost) ensem-

ble techniques rather than bagging (Random Forest) ones.

Moreover, we used the optimal solutions detected by the

optimization algorithm in order to select the appropriate set

of features on each validation dataset. Using only those fea-

tures, we built and tuned an anomaly detection autoenocoder

for each one of these datasets. Obtained evaluation results

demonstrate that our anomaly detection models outperform

the most eﬃcient state-of-the-art techniques applied on these

datasets. Additionally, it achieves reasonable and reduced

training time delays.

However, there are some limitations to our work that

need to be addressed in the future. The ﬁrst one consists of

the fact that we used only two hyper-parameters for the opti-

mization algorithm when using Fandom Forests (number of

trees and test size), and three when using it with XGBoost

Aniss Chohra et al.: Preprint submitted to Elsevier Page 17 of 19

Optimized Feature Selection for Network Anomaly Detection

(number of trees, test size, and learning rate). We are cur-

rently exploring the possibility of adding (optimizing) more

hyper-parameters. In addition, we need to improve the scal-

ability (execution times) of the feature selection (optimiza-

tion) algorithm. Although this latter does not pose an issue,

since it needs to be run only once for each dataset and not

only on a regular basis. Furthermore, we have not explored

the setting of PSO hyper-parameters (𝑐1,𝑐2, and 𝑤) in an

adaptable fashion, which can also improve the search eﬃ-

ciency; this involves the usage of some variations of PSO,

such as Adaptive Particle Swarm Optimization (APSO) Zhan

et al. (2009), in order to ﬁnd the optimal settings for these

three hyper-parameters.

References

Ahmad, A., Khan, M., Paul, A., Din, S., Rathore, M.M., Jeon, G., Choi,

G.S., 2018. Toward modeling and optimization of features selection in

big data based social internet of things. Future Generation Computer

Systems 82, 715–726.

Ahmed, M., Mahmood, A.N., Hu, J., 2016. A survey of network anomaly

detection techniques. Journal of Network and Computer Applications

60, 19–31.

Ali, W., Malebary, S.J., 2020. Particle swarm optimization-based feature

weighting for improving intelligent phishing website detection. IEEE

Access 8, 116766–116780.

Alsaheel, A., Nan, Y., Ma, S., Yu, L., Walkup, G., Celik, Z.B., Zhang,

X., Xu, D., 2021. {ATLAS}: A sequence-based learning approach for

attack investigation, in: 30th USENIX Security Symposium (USENIX

Security 21).

Baig, M.M., Awais, M.M., El-Alfy, E.S.M., 2017. A multiclass cascade

of artiﬁcial neural network for network intrusion detection. Journal of

Intelligent & Fuzzy Systems 32, 2875–2883.

Bühlmann, P., 2012. Bagging, boosting and ensemble methods, in: Hand-

book of Computational Statistics. Springer, pp. 985–1022.

Chalapathy, R., Chawla, S., 2019. Deep learning for anomaly detection: A

survey. arXiv preprint arXiv:1901.03407 .

Chalapathy, R., Khoa, N.L.D., Chawla, S., 2020. Robust deep learn-

ing methods for anomaly detection, in: Proceedings of the 26th ACM

SIGKDD International Conference on Knowledge Discovery & Data

Mining (KDD’20), pp. 3507–3508.

Doan, M., Zhang, Z., 2020. Deep learning in 5G wireless networks-

anomaly detections, in: 29th Wireless and Optical Communications

Conference (WOCC’20), IEEE. pp. 1–6.

Dong, H., Li, T., Ding, R., Sun, J., 2018. A novel hybrid genetic algo-

rithm with granular information for feature selection and optimization.

Applied Soft Computing 65, 33–46.

Du, M., Li, F., Zheng, G., Srikumar, V., 2017. DeepLog: Anomaly detec-

tion and diagnosis from system logs through deep learning, in: Proceed-

ings of the 2017 ACM SIGSAC Conference on Computer and Commu-

nications Security (CCS’17), pp. 1285–1298.

Dutta, V., Choraś, M., Pawlicki, M., Kozik, R., 2020. A deep learning

ensemble for network anomaly and cyber-attack detection. Sensors 20,

4583.

Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera,

F., 2018. Learning from imbalanced data sets. volume 10. Springer.

Ghamisi, P., Benediktsson, J.A., 2014. Feature selection based on hy-

bridization of genetic algorithm and particle swarm optimization. IEEE

Geoscience and Remote Sensing Letters (GRSL) 12, 309–313.

Gomes, H.M., Barddal, J.P., Enembreck, F., Bifet, A., 2017. A survey on

ensemble learning for data stream classiﬁcation. ACM Computing Sur-

veys (CSUR) 50, 1–36.

Hamamoto, A.H., Carvalho, L.F., Sampaio, L.D.H., Abrão, T., Proença Jr,

M.L., 2018. Network anomaly detection system using genetic algorithm

and fuzzy logic. Expert Systems with Applications 92, 390–402.

Hartmann, W.M., 2004. Dimension reduction vs. variable selection, in:

International Workshop on Applied Parallel Computing (PARA’04),

Springer. pp. 931–938.

Hwang, R.H., Peng, M.C., Huang, C.W., Lin, P.C., Nguyen, V.L., 2020.

An unsupervised deep learning model for early network traﬃc anomaly

detection. IEEE Access 8, 30387–30399.

Imamverdiyev, Y., Abdullayeva, F., 2018. Deep learning method for denial

of service attack detection based on restricted boltzmann machine. Big

data 6, 159–169.

Javaid, A., Niyaz, Q., Sun, W., Alam, M., 2016. A deep learning approach

for network intrusion detection system, in: Proceedings of the 9th EAI

International Conference on Bio-inspired Information and Communica-

tions Technologies (formerly BIONETICS), pp. 21–26.

Jia, W.J., Zhang, Y.D., 2018. Survey on theories and methods of autoen-

coder. Computer Systems & Applications 5, 1.

Kennedy, J., Eberhart, R., 1995. Particle swarm optimization, in: Proceed-

ings of 95-International Conference on Neural Networks (ICNN), IEEE.

pp. 1942–1948.

Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J., 2019. A survey

of deep learning-based network anomaly detection. Cluster Computing

22, 949–961.

Laboratory, S.R., 2018. Malware public datasets. URL: https://mcfp.felk.

cvut.cz/publicDatasets/.

Lauzon, F.Q., 2012. An introduction to deep learning, in: 2012 11th In-

ternational Conference on Information Science, Signal Processing and

their Applications (ISSPA), IEEE. pp. 1438–1439.

Lazar, C., Taminau, J., Meganck, S., Steenhoﬀ, D., Coletta, A., Molter,

C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A., 2012. A sur-

vey on ﬁlter techniques for feature selection in gene expression microar-

ray analysis. IEEE/ACM Transactions on Computational Biology and

Bioinformatics 9, 1106–1119.

Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E., 2017. A survey

of deep neural network architectures and their applications. Neurocom-

puting 234, 11–26.

Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., Wang, S., 2011. An

improved particle swarm optimization for feature selection. Journal of

Bionic Engineering 8, 191–200.

Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J., 2017.

Conditional variational autoencoder for prediction and feature recovery

applied to intrusion detection in iot. Sensors 17, 1967.

Ma, Q., Sun, C., Cui, B., Jin, X., 2021. A novel model for anomaly detection

in network traﬃc based on kernel support vector machine. Computers

& Security 104, 102215.

Ma, T., Wang, F., Cheng, J., Yu, Y., Chen, X., 2016. A hybrid spectral

clustering and deep neural network ensemble algorithm for intrusion de-

tection in sensor networks. Sensors 16, 1701.

Mahalakshmi, G., Uma, E., Aroosiya, M., Vinitha, M., 2021. Intrusion de-

tection system using convolutional neuralnetwork on unsw nb15 dataset,

in: Advances in Parallel Computing Technologies and Applications. IOS

Press, pp. 1–8.

Marini, F., Walczak, B., 2015. Particle swarm optimization (PSO). A tuto-

rial. Chemometrics and Intelligent Laboratory Systems 149, 153–165.

Merrill, N., Eskandarian, A., 2020. Modiﬁed autoencoder training and scor-

ing for robust unsupervised anomaly detection in deep learning. IEEE

Access 8, 101824–101833.

Min, B., Yoo, J., Kim, S., Shin, D., Shin, D., 2021. Network anomaly

detection using memory-augmented deep autoencoder. IEEE Access 9,

104695–104706.

Moustafa, N., Creech, G., Slay, J., 2017. Big data analytics for intrusion de-

tection system: Statistical decision-making using Finite Dirichlet mix-

ture models, in: Data analytics and decision support for cybersecurity:

Trends, Methodologies and Applications. Springer, pp. 127–156.

Moustafa, N., Slay, J., 2015. UNSW-NB15: a comprehensive data set for

network intrusion detection systems (UNSW-NB15 network data set),

in: 2015 Military Communications and Information Systems Confer-

ence (MilCIS), IEEE. pp. 1–6.

Moustafa, N., Slay, J., 2016a. The evaluation of network anomaly detec-

tion systems: statistical analysis of the UNSW-NB15 data set and the

comparison with the KDD99 data set. Information Security Journal: A

Aniss Chohra et al.: Preprint submitted to Elsevier Page 18 of 19

Optimized Feature Selection for Network Anomaly Detection

Global Perspective 25, 18–31.

Moustafa, N., Slay, J., 2016b. The evaluation of network anomaly detec-

tion systems: Statistical analysis of the UNSW-NB15 data set and the

comparison with the KDD99 data set. Information Security Journal: A

Global Perspective 25, 18–31.

Moustafa, N., Slay, J., Creech, G., 2019. Novel geometric area analysis

technique for anomaly detection using trapezoidal area estimation on

large-scale networks. IEEE Transactions on Big Data 5, 481–494.

NASA AVIRIS Sensor, 2021. Indian Pines dataset. URL:

http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_

Scenes#Indian_Pines.

Nkenyereye, L., Tama, B.A., Lim, S., 2021. A stacking-based deep neu-

ral network approach for eﬀective network anomaly detection. CMC-

COMPUTERS MATERIALS & CONTINUA 66, 2217–2227.

Oreski, S., Oreski, G., 2014. Genetic algorithm-based heuristic for feature

selection in credit risk assessment. Expert Systems with Applications

41, 2052–2064.

Paxson, V., 1999. Bro: a system for detecting network intruders in real-

time. Computer networks 31, 2435–2463.

Rashid, M., Kamruzzaman, J., Imam, T., Wibowo, S., Gordon, S., 2022.

A tree-based stacking ensemble technique with feature selection for net-

work intrusion detection. Applied Intelligence , 1–14.

Roy, A., Singh, K.J., 2021. Multi-classiﬁcation of UNSW-NB15 dataset

for network anomaly detection system, in: Proceedings of Interna-

tional Conference on Communication and Computational Technologies,

Springer. pp. 429–451.

Roy, S., Li, J., Choi, B.J., Bai, Y., 2022. A lightweight supervised intru-

sion detection mechanism for iot networks. Future Generation Computer

Systems 127, 276–285.

Sagi, O., Rokach, L., 2018. Ensemble learning: A survey. Wiley Interdis-

ciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249.

Sheikhpour, R., Sarram, M.A., Gharaghani, S., Chahooki, M.A.Z., 2017.

A survey on semi-supervised feature selection methods. Pattern Recog-

nition 64, 141–158.

Shen, Y., Stringhini, G., 2019. Attack2vec: Leveraging temporal word em-

beddings to understand the evolution of cyberattacks, in: 28th USENIX

Security Symposium (USENIX Security 19), pp. 905–921.

Tama, B.A., Nkenyereye, L., Islam, S.R., Kwak, K.S., 2020. An enhanced

anomaly detection in web traﬃc using a stack of classiﬁer ensemble.

IEEE Access 8, 24120–24134.

Tang, T.A., Mhamdi, L., McLernon, D., Zaidi, S.A.R., Ghogho, M., 2016.

Deep learning approach for network intrusion detection in software de-

ﬁned networking, in: 2016 international conference on wireless net-

works and mobile communications (WINCOM), IEEE. pp. 258–263.

Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analy-

sis of the KDD CUP 99 data set, in: IEEE symposium on Computational

Intelligence for Security and Defense Applications (CISDA’09), IEEE.

pp. 1–6.

Team, Z., 2018. Zeek an open source network security monitoring tool.

URL: https://zeek.org/.

Xie, M., Han, S., Tian, B., Parvin, S., 2011. Anomaly detection in wireless

sensor networks: A survey. Journal of Network and computer Applica-

tions 34, 1302–1325.

Xiong, P., Cui, B., Cheng, Z., 2020. Anomaly network traﬃc detection

based on deep transfer learning, in: International Conference on Innova-

tive Mobile and Internet Services in Ubiquitous Computing (IMIS’20),

Springer. pp. 384–393.

Xue, B., Zhang, M., Browne, W.N., 2012. Particle swarm optimization for

feature selection in classiﬁcation: A multi-objective approach. IEEE

Transactions on Cybernetics 43, 1656–1671.

Yang, Y., Zheng, K., Wu, C., Yang, Y., 2019. Improving the classiﬁca-

tion eﬀectiveness of intrusion detection by using improved conditional

variational autoencoder and deep neural network. Sensors 19, 2528.

Yin, C., Zhu, Y., Fei, J., He, X., 2017. A deep learning approach for intru-

sion detection using recurrent neural networks. IEEE Access 5, 21954–

21961.

Zhan, Z.H., Zhang, J., Li, Y., Chung, H.S.H., 2009. Adaptive particle

swarm optimization. IEEE Transactions on Systems, Man, and Cyber-

netics, Part B (Cybernetics) 39, 1362–1381.

Aniss Chohra et al.: Preprint submitted to Elsevier Page 19 of 19

A critical review of feature selection methods for machine learning in IoT security

Article

Full-text available

Jan 2024

Enhancing IoT Security: A Comparative Study of Feature Reduction Techniques for Intrusion Detection System

Article

Full-text available

Jun 2024

Localization of Dielectric Anomalies with Multi-level Outlier Detection through Membership Function and Ensemble Classification Framework

Article

Apr 2024

This research presents an innovative method for real-time detection of dielectric anomalies, with a primary focus on evaluating apple quality and ripeness using dielectric tomography. The study involves the development of an advanced tomography system within an anechoic chamber, harnessing electromagnetic wave technology and sophisticated antenna systems for data acquisition. The proposed framework encompasses critical stages, including data collection, range bounds computation, threshold determination, class membership assignment, and ensemble classification. By seamlessly integrating statistical methods, density-based clustering, and ensemble learning, this approach significantly enhances precision and reliability in anomaly detection. The integration of available statistical methods, density-based clustering, and ensemble learning may demand substantial computational resources, limiting the scalability and real-time applicability of the proposed framework. Empirical results demonstrate the superior performance of the method, with an accuracy rate of 98.9%, precision of 0.989, F-measure of 0.989, dielectric anomaly recall rate of 0.99, and a low error rate of 0.18. Overall, this research introduces an advanced approach with the potential to revolutionize apple quality assessment and industrial processes across various sectors.

Detection of rice type and its storage duration via an improved particle swarm optimization algorithm

Article

Full-text available

Apr 2024

Due to the non-selective behavior of gas sensors in electronic nose (e-nose) systems, the provided signals in exposure to target analytes contain un-needed information. These are considered as noise reducing the detection accuracy. Feature selection, as a pre-processing step in data analysis, removes extra information from the sensors’ signals and provides a more relevant data matrix with lower dimensionality to enhance the system selectivity. In the high-dimensional sensor array response spaces, it is however essential to improve the conventional algorithms to be able to cope with complicated feature selection problem. In this study, in order to acquire optimal responses from the gas sensor array of an e-nose system and increase its selectivity for detection of rice type and its storage duration (freshness), the feature selection problem was formulated in an optimization framework. For this reason, a particle swarm optimization (PSO) with an automatic stagnation detecting system was enhanced by genetic operators of differential evolution. This helped acquiring more exploration ability through providing oriented jumps. It was revealed that the system’s detection accuracy was improved when smaller subset of features was utilized instead of the whole response, indicating that the sensor array signals included large amount of irrelevant information. The improved PSO could significantly present lower error values than the standard PSO and other examined conventional algorithms. It was concluded the developed algorithm has the potential to be applied as a promising feature selection algorithm in high-dimensional signals of the e-nose systems.

Optimizing IoT intrusion detection system: feature selection versus feature extraction in machine learning

Article

Full-text available

Feb 2024

Internet of Things (IoT) devices are widely used but also vulnerable to cyberattacks that can cause security issues. To protect against this, machine learning approaches have been developed for network intrusion detection in IoT. These often use feature reduction techniques like feature selection or extraction before feeding data to models. This helps make detection efficient for real-time needs. This paper thoroughly compares feature extraction and selection for IoT network intrusion detection in machine learning-based attack classification framework. It looks at performance metrics like accuracy, f1-score, and runtime, etc. on the heterogenous IoT dataset named Network TON-IoT using binary and multiclass classification. Overall, feature extraction gives better detection performance than feature selection as the number of features is small. Moreover, extraction shows less feature reduction compared with that of selection, and is less sensitive to changes in the number of features. However, feature selection achieves less model training and inference time compared with its counterpart. Also, more space to improve the accuracy for selection than extraction when the number of features changes. This holds for both binary and multiclass classification. The study provides guidelines for selecting appropriate intrusion detection methods for particular scenarios. Before, the TON-IoT heterogeneous IoT dataset comparison and recommendations were overlooked. Overall, the research presents a thorough comparison of feature reduction techniques for machine learning-driven intrusion detection in IoT networks.

Anomaly detection based on Artificial Intelligence of Things: A Systematic Literature Mapping

Article

Full-text available

Apr 2024

Advanced Machine Learning (ML) algorithms can be applied using Edge Computing (EC) to detect anomalies, which is the basis of Artificial Intelligence of Things (AIoT). EC has emerged as a solution for processing and analysing information on IoT devices. This field aims to allow the implementation of Machine/Deep Learning (DL) models on MicroController Units (MCUs). Integrating anomaly detection analysis on Internet of Things (IoT) devices produces clear benefits as it ensures the use of accurate data from the initial stage. However, this process poses a challenge due to the unique characteristics of IoT. This article presents a Systematic Literature Mapping of scientific research on the application of anomaly detection techniques in EC using MCUs. A total of 18 papers published over the period 2021–2023 were selected from a total of 162 in four databases of scientific papers. The results of this paper provide a comprehensive overview of anomaly detection using TinyML and MCUs. The main contributions of this survey are the fact that it aims to: (a) study techniques for anomaly detection in ML/DL and validation metrics used in the AIoT; (b) analyse data used in the estimation of models; (c) show how ML is applied in EC using hardware or software; (d) investigate the main microcontrollers, types of power supply, and communication technology; and (e) develop a taxonomy of ML/DL algorithms used to detect anomalies in TinyML. Finally, the benefits and challenges of this kind of TinyML analysis are described.

A hybrid feature weighted attention based deep learning approach for an intrusion detection system using the random forest algorithm

Article

Full-text available

May 2024
PLOS ONE

Due to the recent advances in the Internet and communication technologies, network systems and data have evolved rapidly. The emergence of new attacks jeopardizes network security and make it really challenging to detect intrusions. Multiple network attacks by an intruder are unavoidable. Our research targets the critical issue of class imbalance in intrusion detection, a reflection of the real-world scenario where legitimate network activities significantly out number malicious ones. This imbalance can adversely affect the learning process of predictive models, often resulting in high false-negative rates, a major concern in Intrusion Detection Systems (IDS). By focusing on datasets with this imbalance, we aim to develop and refine advanced algorithms and techniques, such as anomaly detection, cost-sensitive learning, and oversampling methods, to effectively handle such disparities. The primary goal is to create models that are highly sensitive to intrusions while minimizing false alarms, an essential aspect of effective IDS. This approach is not only practical for real-world applications but also enhances the theoretical understanding of managing class imbalance in machine learning. Our research, by addressing these significant challenges, is positioned to make substantial contributions to cybersecurity, providing valuable insights and applicable solutions in the fight against digital threats and ensuring robustness and relevance in IDS development. An intrusion detection system (IDS) checks network traffic for security, availability, and being non-shared. Despite the efforts of many researchers, contemporary IDSs still need to further improve detection accuracy, reduce false alarms, and detect new intrusions. The mean convolutional layer (MCL), feature-weighted attention (FWA) learning, a bidirectional long short-term memory (BILSTM) network, and the random forest algorithm are all parts of our unique hybrid model called MCL-FWA-BILSTM. The CNN-MCL layer for feature extraction receives data after preprocessing. After convolution, pooling, and flattening phases, feature vectors are obtained. The BI-LSTM and self-attention feature weights are used in the suggested method to mitigate the effects of class imbalance. The attention layer and the BI-LSTM features are concatenated to create mapped features before feeding them to the random forest algorithm for classification. Our methodology and model performance were validated using NSL-KDD and UNSW-NB-15, two widely available IDS datasets. The suggested model’s accuracies on binary and multi-class classification tasks using the NSL-KDD dataset are 99.67% and 99.88%, respectively. The model’s binary and multi-class classification accuracies on the UNSW-NB15 dataset are 99.56% and 99.45%, respectively. Further, we compared the suggested approach with other previous machine learning and deep learning models and found it to outperform them in detection rate, FPR, and F-score. For both binary and multiclass classifications, the proposed method reduces false positives while increasing the number of true positives. The model proficiently identifies diverse network intrusions on computer networks and accomplishes its intended purpose. The suggested model will be helpful in a variety of network security research fields and applications.

Network Intrusion Classification on the UNSW-NB15 Dataset Using XGBoost Feature Selection Technique

Conference Paper

Dec 2023

Feature Selection Problem and Metaheuristics: A Systematic Literature Review about Its Formulation, Evaluation and Applications

Article

Full-text available

Dec 2023

Feature selection is becoming a relevant problem within the field of machine learning. The feature selection problem focuses on the selection of the small, necessary, and sufficient subset of features that represent the general set of features, eliminating redundant and irrelevant information. Given the importance of the topic, in recent years there has been a boom in the study of the problem, generating a large number of related investigations. Given this, this work analyzes 161 articles published between 2019 and 2023 (20 April 2023), emphasizing the formulation of the problem and performance measures, and proposing classifications for the objective functions and evaluation metrics. Furthermore, an in-depth description and analysis of metaheuristics, benchmark datasets, and practical real-world applications are presented. Finally, in light of recent advances, this review paper provides future research opportunities.

A systematic review of data fusion techniques for optimized structural health monitoring

Article

Nov 2023
INFORM FUSION

Intrusion Detection System Using Convolutional Neural Network on UNSW NB15 Dataset

Chapter

Full-text available

Nov 2021

Networks have an important role in our modern life. In the network, Cyber security plays a crucial role in Internet security. An Intrusion Detection System (IDS) acts as a cyber security system which monitors and detects any security threats for software and hardware running on the network. There we have many existing IDS but still we face challenges in improving accuracy in detecting security vulnerabilities, not enough methods to reduce the level of alertness and detecting intrusion attacks. Many researchers have tried to solve the above problems by focusing on developing IDSs by machine learning methods. Machine learning methods can detect datas from past experience and differentiate normal and abnormal data. In our work, the Convolutional Neural Network(CNN) deep learning method was developed in solving the problem of identifying intrusion in a network. Using the UNSW NB15 public dataset we trained the CNN algorithm. The Dataset contains binary types of ‘0’ and ‘1’ in general for normal and attack datas. The experimental results showed that the proposed model achieves maximum accuracy in detection and we also performed evaluation metrics to analyze the performance of the CNN algorithm.

A tree-based stacking ensemble technique with feature selection for network intrusion detection

Article

Full-text available

Jul 2022
APPL INTELL

Several studies have used machine learning algorithms to develop intrusion systems (IDS), which differentiate anomalous behaviours from the normal activities of network systems. Due to the ease of automated data collection and subsequently an increased size of collected data on network traffic and activities, the complexity of intrusion analysis is increasing exponentially. A particular issue, due to statistical and computation limitations, a single classifier may not perform well for large scale data as existent in modern IDS contexts. Ensemble methods have been explored in literature in such big data contexts. Although more complicated and requiring additional computation, literature has a note that ensemble methods can result in better accuracy than single classifiers in different large scale data classification contexts, and it is interesting to explore how ensemble approaches can perform in IDS. In this research, we introduce a tree-based stacking ensemble technique (SET) and test the effectiveness of the proposed model on two intrusion datasets (NSL-KDD and UNSW-NB15). We further enhance incorporate feature selection techniques to select the best relevant features with the proposed SET. A comprehensive performance analysis shows that our proposed model can better identify the normal and anomaly traffic in network than other existing IDS models. This implies the potentials of our proposed system for cybersecurity in Internet of Things (IoT) and large scale networks.

Network Anomaly Detection Using Memory-Augmented Deep Autoencoder

Article

Full-text available

Jul 2021

In recent years, attacks on network environments continue to rapidly advance and are increasingly intelligent. Accordingly, it is evident that there are limitations in existing signature-based intrusion detection systems. In particular, for novel attacks such as Advanced Persistent Threat (APT), signature patterns have problems with poor generalization performance. Furthermore, in a network environment, attack samples are rarely collected compared to normal samples, creating the problem of imbalanced data. Anomaly detection using an autoencoder has been widely studied in this environment, and learning is through semi-supervised learning methods to overcome these problems. This approach is based on the assumption that reconstruction errors for samples that are not used for training will be large, but an autoencoder is often over-generalized and this assumption is often broken. In this paper, we propose a network intrusion detection method using a memory-augmented deep auto-encoder (MemAE) that can solve the over-generalization problem of autoencoders. The MemAE model is trained to reconstruct the input of an abnormal sample that is close to a normal sample, which solves the generalization problem for such abnormal samples. Experiments were conducted on the NSL-KDD, UNSW-NB15, and CICIDS 2017 datasets, and it was confirmed that the proposed method is better than other one-class models.

A Stacking-Based Deep Neural Network Approach for Effective Network Anomaly Detection

Article

Full-text available

Feb 2021
CMC-COMPUT MATER CON

An anomaly-based intrusion detection system (A-IDS) provides a critical aspect in a modern computing infrastructure since new types of attacks can be discovered. It prevalently utilizes several machine learning algorithms (ML) for detecting and classifying network traffic. To date, lots of algorithms have been proposed to improve the detection performance of AIDS , either using individual or ensemble learners. In particular, ensemble learners have shown remarkable performance over individual learners in many applications, including in cybersecurity domain. However, most existing works still suffer from unsatisfactory results due to improper ensemble design. The aim of this study is to emphasize the effectiveness of stacking ensemble-based model for AIDS , where deep learning (e.g., deep neural network [DNN]) is used as base learner model. The effectiveness of the proposed model and base DNN model are benchmarked empirically in terms of several performance metrics, i.e., Matthew's correlation coefficient, accuracy, and false alarm rate. The results indicate that the proposed model is superior to the base DNN model as well as other existing ML algorithms found in the literature.

A Deep Learning Ensemble for Network Anomaly and Cyber-Attack Detection

Article

Full-text available

Aug 2020
SENSORS-BASEL

Currently, expert systems and applied machine learning algorithms are widely used to automate network intrusion detection. In critical infrastructure applications of communication technologies, the interaction among various industrial control systems and the Internet environment intrinsic to the IoT technology makes them susceptible to cyber-attacks. Given the existence of the enormous network traffic in critical Cyber-Physical Systems (CPSs), traditional methods of machine learning implemented in network anomaly detection are inefficient. Therefore, recently developed machine learning techniques, with the emphasis on deep learning, are finding their successful implementations in the detection and classification of anomalies at both the network and host levels. This paper presents an ensemble method that leverages deep models such as the Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) and a meta-classifier (i.e., logistic regression) following the principle of stacked generalization. To enhance the capabilities of the proposed approach, the method utilizes a two-step process for the apprehension of network anomalies. In the first stage, data pre-processing, a Deep Sparse AutoEncoder (DSAE) is employed for the feature engineering problem. In the second phase, a stacking ensemble learning approach is utilized for classification. The efficiency of the method disclosed in this work is tested on heterogeneous datasets, including data gathered in the IoT environment, namely IoT-23, LITNET-2020, and NetML-2020. The results of the evaluation of the proposed approach are discussed. Statistical significance is tested and compared to the state-of-the-art approaches in network anomaly detection.

Particle Swarm Optimization-Based Feature Weighting for Improving Intelligent Phishing Website Detection

Article

Full-text available

Jun 2020

Over the last few years, web phishing attacks have been constantly evolving causing customers to lose trust in e-commerce and online services. Various tools and systems based on a blacklist of phishing websites are applied to detect the phishing websites. Unfortunately, the fast evolution of technology has led to the born of more sophisticated methods when building websites to attract users. Thus, the latest and newly deployed phishing websites; for example, zero-day phishing websites, cannot be detected by using these blacklist-based approaches. Several recent research studies have been adopting machine learning techniques to identify phishing websites and utilizing them as an early alarm method to identify such threats. However, the important website features have been selected based on human experience or frequency analysis of website features in most of these approaches. In this paper, intelligent phishing website detection using particle swarm optimization-based feature weighting is proposed to enhance the detection of phishing websites. The proposed approach suggests utilizing particle swarm optimization (PSO) to weight various website features effectively to achieve higher accuracy when detecting phishing websites. In particular, the proposed PSO-based website feature weighting is used to differentiate between the various features in websites, based on how important they contribute towards recognizing the phishing from legitimate websites. The experimental results indicated that the proposed PSO-based feature weighting achieved outstanding improvements in terms of classification accuracy, true positive and negative rates, and false positive and negative rates of the machine learning models using only fewer websites features utilized in the detection of phishing websites.

A lightweight supervised intrusion detection mechanism for IoT networks

Article

Feb 2022
FUTURE GENER COMP SY

As the Internet of Things (IoT) is becoming increasingly popular, we have experienced more security breaches that are associated with the connection of vulnerable IoT devices. Therefore, it is crucial to employ intrusion detection techniques to mitigate attacks that exploit IoT security vulnerabilities. However, due to the limited capabilities of IoT devices and the specific protocols used, conventional intrusion detection mechanisms may not work well for IoT environments. In this paper, we propose a novel intrusion detection model that uses machine learning to effectively detect cyber-attacks and anomalies in resource-constraint IoT networks. Through a set of optimizations including removal of multicollinearity, sampling, and dimensionality reduction, our model can identify the most important features to detect intrusions using much fewer training data and less training time. Extensive experiments were performed on the CICIDS2017 and NSL-KDD datasets respectively to evaluate the proposed approach. The experimental results on two popular datasets show that our model has a high detection rate and a low false alarm rate. It outperforms existing models in multiple performance metrics and is consistent in classifying major cyber-attacks, respectively. Most importantly, unlike traditional resource-intensive intrusion detection systems, the proposed model is lightweight and can be deployed on IoT nodes with limited power and storage capabilities.

A Novel Model for Anomaly Detection in Network Traffic Based on Kernel Support Vector Machine

Article

Feb 2021
COMPUT SECUR

Machine learning models are widely used for anomaly detection in network traffic. Effective transformation of the raw traffic data into mathematical expressions and hyper-parameter adjustment are two important steps before training the machine learning classifier, which is used to predict whether the unknown traffic is normal or abnormal. In this paper, a novel model SVM-L is proposed for anomaly detection in network traffic. In particular, raw URLs are treated as natural language, and then transformed into mathematical vectors via statistical laws and natural language processing technique. They are used as the training data for the traffic classifier, the kernel Support Vector Machine (SVM). Based on the idea of the dual formulation of kernel SVM and Linear Discriminant Analysis (LDA), we propose an optimization model to adjust the hyper-parameter of the classifier. The corresponding problem is simply one-dimensional, and is easily solved by the golden section method. Numerical tests indicate that the proposed model achieves more than 99% accuracy on all tested datasets, and outperforms the state of the arts in terms of standard evaluation measurements.

Multi-classification of UNSW-NB15 Dataset for Network Anomaly Detection System

Chapter