Content uploaded by Paria Shirani
Author content
All content in this area was uploaded by Paria Shirani on Oct 09, 2022
Content may be subject to copyright.
CHAMELEON: Optimized Feature Selection using Particle Swarm
Optimization and Ensemble Methods for Network Anomaly Detection
Aniss Chohraa,∗, Paria Shiranib,∗∗, ElMouatez Billah Karbabaand Mourad Debbabia
aSecurity Research Centre, Gina Cody School of Engineering and Computer Science, Concordia University, Montréal, Québec, Canada
bDepartment of Computer Science, Ryerson University, Toronto, Ontario, Canada
ARTICLE INFO
Keywords:
Feature Selection
Swarm Intelligence
Particle Swarm Optimization (PSO)
Ensemble Methods
Internet of Things (IoT)
Network Anomaly Detection
Deep Learning
ABSTRACT
In this paper, we propose an optimization approach by leveraging swarm intelligence and ensemble
methods to solve the non-deterministic feature selection problem. The proposed approach is validated
on two benchmark datasets, namely, NSL-KDD and UNSW-NB15, in addition to a third dataset,
called IoT-Zeek dataset, which consists of Zeek network-based intrusion detection connection logs.
We build the IoT-Zeek dataset by employing ensemble classification and deep learning models using
publicly available malicious and benign threat intelligence on the Zeek connection logs of IoT devices.
Moreover, we deploy and validate a deep learning-based anomaly detection model using autoencoders
on each of the aforementioned datasets by utilizing the selected features obtained from the proposed
optimization approach. The obtained results demonstrate that our approach outperform the existing
state-of-the-art machine learning models in terms of 𝑓1score results, with 92.092% 𝑓1score on NSL-
KDD dataset, 92.904 𝑓1score on UNSW-NB15 dataset, and 97.302 𝑓1score on IoT-Zeek dataset.
1. Introduction
Due to the emerging technologies, the large connectivity
between different devices in different ecosystems, and the
increasing rate of cyberattacks (e.g., IoT attacks increased
700% in the last two years1), security analysis of the net-
work data is an absolute need. However, providing accurate
and efficient threat detection solutions on large volume of
data becomes more challenging. On the other hand, during
the last decade, machine learning and deep learning tech-
niques have attracted tremendous attention in many fields
(e.g., anomaly detection, vulnerability assessment, natural
language processing, stock market, and weather forecast).
Therefore, training efficient and scalable machine learning
and deep learning based threat detection models became a
task of paramount importance.
There are two common and known problems that need to
be addressed to provide efficient, accurate and scalable mod-
els. (i) Selecting the appropriate setting of hyper-parameters
for the model to be trained: this task generally falls in the
non-deterministic problems class, as it might have several
solutions that give the same accuracy results; meaning that
this kind of problem accepts at least two possible solutions
(optimal solutions). (ii) Selecting the appropriate set of fea-
tures that best define the final problem. There exists lots of
features in most of the domains, which makes it time con-
∗Corresponding author.
∗∗ Part of this work has been done during the postdoctoral fellowship of
the author at Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
a_chohra@encs.concordia.ca ( Aniss Chohra);
paria.shirani@ryerson.ca ( Paria Shirani); e_karbab@encs.concordia.ca (
ElMouatez Billah Karbab); debbabi@ciise.concordia.ca ( Mourad
Debbabi)
ORCID (s): 0000-0003-1823-713X ( Aniss Chohra); 0000-0001-5592-1518
( Paria Shirani); 0000-0003-1293-8314 ( ElMouatez Billah Karbab);
0000-0003-3015-3043 ( Mourad Debbabi)
1https://www.darkreading.com/endpoint/
iot-specific- malware-infections- jumped-700- amid-pandemic
suming to train and validate the models. Moreover, some
of those features are irrelevant due to the presence of redun-
dancy, sparsity, or lack of correlation to the problem to be
solved. Therefore, the need for methods to better filter the
irrelevant features has become a widely adopted procedure
before any model training and experimentation.
There is a palette of techniques that have been proposed
to select the most relevant features. The most common
approach is the use of ensemble methods (e.g., Sagi and
Rokach (2018); Sheikhpour et al. (2017); Lazar et al. (2012))
due to the fact that these methods provide easier explana-
tion of the variables compared to other techniques. How-
ever, sometimes it becomes difficult to know which features
are given more importance than others by the model Gomes
et al. (2017). In addition, ensemble learning techniques
combine multiple models all together in order to improve the
overall predictive capability and to decrease the overfitting
as much as possible.
There exists state-of-the-art techniques that are proposed
to deal with the non-deterministic aspect of the feature se-
lection problem. These works generally use optimization al-
gorithms to find optimal solutions and make decisions ac-
cording to a certain objective function. For instance, Ah-
mad et al. (2018) propose a feature selection approach using
artificial bee colony (ABC), and Dong et al. (2018) incorpo-
rates a hybrid genetic algorithm with granular information.
However, these approaches do not explore the usage of their
solutions on other types of datasets (e.g., intrusion detection
systems (IDS)).
On the other hand, Autoencoders Liu et al. (2017); Jia
and Zhang (2018) are a type of neural network, which
aim at reconstructing a given input into an output with the
least possible changes. Autoencoders are widely used for
anomaly detection Chalapathy and Chawla (2019); Ahmed
et al. (2016); Kwon et al. (2019); Xie et al. (2011) due to their
ability to better represent (compress) the data to a latent-
Aniss Chohra et al.: Preprint submitted to Elsevier Page 1 of 19
Optimized Feature Selection for Network Anomaly Detection
representation (bottleneck), which consists of a reduced rep-
resentation of the input. In addition, their ability to recon-
struct the input, makes them more suitable to the anomaly
detection task; the anomalies can be detected by comparing
the reconstructed output with the input, and then flag it as
anomaly if there are any deviations from the input.
Moreover, there exists different works that are proposed
to detect anomalies using deep learning models, e.g., Du
et al. (2017); Merrill and Eskandarian (2020); Hwang et al.
(2020); Dutta et al. (2020); Xiong et al. (2020); Doan and
Zhang (2020); Chalapathy et al. (2020). However, to the best
of our knowledge, none of them explores the effects of fea-
ture selection before applying the anomaly detection model.
Problem Statement: Using all the features present in the
input data to training autoencoders can be quite troublesome
and time consuming, especially where the input data con-
tains millions of records, making the experimentation and
model engineering more complicated and difficult. More-
over, autoencoders focus mainly on feature engineering and
extraction rather than feature selection. In other words, by
transforming the input data into a compressed representa-
tion, autoencoders are able to reduce the dimensionality of
the data and learn a smaller representation. In the case of
large number of features, explaining and understanding the
compressed data is difficult, whilst feature selection identi-
fies the most useful and relevant features that best describe
and define a given ground truth variable Hartmann (2004).
Generally, during the feature selection process, choos-
ing the appropriate set of hyper-parameters is quite chal-
lenging. This is due to the fact that it is a non-deterministic
problem, which can have multiple optimal solutions; all of
them would give the same performance and accuracy results.
Thus, even after exhaustive experimentation, there is no ev-
idence to prove that (i) all the possible solutions have been
explored, and (ii) a particular solution is the best one.
Key Idea: In this context, we propose a novel approach,
called CHAMELEON, which combines both swarm intelli-
gence and ensemble learning techniques to select the optimal
settings (hyper-parameter selection for the ensemble models
as well as selection of most relevant features for each in-
dividual dataset) for feature selection task. The proposed
approach uses ensemble learning classifiers as a fitness and
evaluation function for each individual/particle within the
population/swarm. This population aims to converge to the
optimal solutions in an iterative process, where in each it-
eration, each individual tries to move closer to the optimal
solutions. Afterwards, we use the selected features given by
the optimal ensemble model to construct an anomaly detec-
tion autoencoder; we iteratively improve the model until it
outperforms the state-of-the-art models.
Contributions: The main contributions of our work are
summarized as follows:
•Novel feature selection for network datasets: We pro-
pose a feature selection approach for network datasets
that leverages both swarm intelligence and ensemble
methods to select the most relevant features. The en-
semble methods are used as a fitness function within
the optimization approach in order to leverage their
ability to better interpret and select the independent
features.
•Training time improvement: We employ the selected
features obtained from the optimization step and de-
ploy deep learning models for network anomaly detec-
tion. The feature selection process considerably im-
proves the training time compared to the case where
all features are used.
•Malicious and benign dataset generation: We setup
an environment and generate a dataset called IoT-Zeek
dataset from PCAPS and connection logs using Zeek
NIDS Team (2018). Then, we introduce an ensem-
ble model leveraging classical machine learning and
deep learning classifiers in order to learn malicious be-
haviour on the generated network traffic and classify
network logs into malicious or benign connections.
•Evaluation: We evaluate our proposed approach on
different datasets (i.e., IoT dataset: IoT-Zeek, and
non-IoT datasets: NSL-KDD and UNSW-NB15) and
demonstrate its efficiency and performance. In addi-
tion, performed experiments on the selected features
obtained from the optimal solution for each dataset in-
dicate that our proposed model outperforms existing
works.
This paper is organized as follows. The most rele-
vant state-of-the-art works are discussed in section 2. An
overview of the proposed approach along with the method-
ologies are presented in section 3. The evaluation results are
provided in section 4. The limitations of our approach along
with the concluding remarks are presented in section 5.
2. Related Work
In this section, we present the most relevant and impor-
tant works that have been proposed for: (i) feature selec-
tion using optimization algorithms and (ii) anomaly detec-
tion and maliciousness fingerprinting using machine learn-
ing and deep learning models.
2.1. Feature Selection Using Optimization
Ahmad et al. (2018) propose a feature selection approach
using Artificial Bee Colony (ABC). In addition, they in-
tegrate a Kalman filter2alongside Hadoop ecosystem3for
noise removal. The system is validated on ten datasets and
compared with swarm intelligence approaches. However,
the authors have not applied their approach on IDS datasets.
Dong et al. (2018) propose a technique for feature se-
lection, which incorporates a hybrid genetic algorithm with
2http://web.mit.edu/kirtley/kirtley/binlustuff/literature/
control/Kalman%20filter.pdf
3https://hadoop.apache.org/
Aniss Chohra et al.: Preprint submitted to Elsevier Page 2 of 19
Optimized Feature Selection for Network Anomaly Detection
granular information. This technique is tested on 11 bench-
mark financial datasets and has been compared with cer-
tain state-of-the-art techniques. The obtained results demon-
strate that it achieves high classification accuracy. However,
their work does not explore the usage of the approach on
other types of datasets (e.g., IDS and network dataset).
In Xue et al. (2012), a novel feature selection approach is
proposed for classification, where the feature selection task
is considered as a non-deterministic problem. The authors
investigated two types of multi-objective particle swarm op-
timization algorithms (PSO). The first one leverages the con-
cept of non-dominated sorting in the feature selection prob-
lem. Whilst the second one introduces more evolutionary
concepts (mutation and crossover) to search for better opti-
mal solutions. These two algorithms were then compared
with two standard feature selection techniques and then val-
idated on twelve benchmark datasets. However, they did not
explore the usage of more complex fitness functions.
A novel approach for feature selection is proposed is Liu
et al. (2011), which combines multi-swarm particle swarm
optimization (MSPSO) and support vector machines (SVM)
as fitness function, with 𝑓1score being the fitness value. The
goal was to execute both kernel optimization and feature se-
lection simultaneously in order to get better generalization.
The proposed approach was then compared with state-of-
the-art feature selection algorithms using PSO, genetic algo-
rithm (GA), and grid search, using ten UCI (UC Irvine)4ma-
chine learning benchmark datasets for validation. The evalu-
ation results show that their novel technique outperforms the
three aforementioned techniques. However, the proposed al-
gorithm is only specific to the datasets used for validation
and has not been tested on the network IDS datasets.
In Ghamisi and Benediktsson (2014), the authors pro-
posed a feature selection approach which combines both
GA and PSO algorithms, where SVMs are used as fitness
function and the accuracy metric as fitness value. The
proposed technique was validated on Indian Pines Spectral
dataset NASA AVIRIS Sensor (2021) and the results show
that the approach can select the most relevant features that
allow higher accuracy results for classification. However,
the authors did not present an exhaustive study on bench-
mark datasets neither a comparative study with state-of-the-
art techniques. Moreover, the proposed solution is only lim-
ited to the utilized dataset.
A novel approach for feature selection with combining
genetic algorithm with neural networks (HGA-NN) intro-
duced in Oreski and Oreski (2014). The approach was ap-
plied to real-world credit dataset collected from the Croat-
ian Bank, and furthermore evaluated on a benchmark credit
dataset selected from UCI database. Finally, this technique
was compared to existing classification works in terms of ac-
curacy results and showed that it outperforms them. How-
ever, we find that this technique focuses more on the accu-
racy rather than 𝑓1score, and has only been applied to UCI
datasets.
4https://archive.ics.uci.edu/ml/datasets.php
2.2. Deep Learning and Anomaly Detection
In Tama et al. (2020), the authors present a novel
anomaly detection system on web applications by propos-
ing a stacked ensemble by combining other ensemble mod-
els (e.g., random forests, xgboost). Four datasets (CSIC-
2010v2, CICIDS-2017, NSL-KDD, UNSW-NB15) were
used for the validate of their approach. The obtained results
show that the proposed stacked model outperforms exist-
ing web attacks detection solutions in terms of accuracy and
false positive rate (FPR) metrics. However, the authors have
not preformed a scalability and complexity study of their
approach; especially for two large datasets (UNSW-NB15
and CICIDS-2017). Nkenyereye et al. (2021) also proposed
a stacked-based model for anomaly-based intrusion detec-
tion systems; where the based learners/models are basically
deep neural networks (DNN). Their approach is then val-
idated on benchmark datasets (NSL-KDD, UNSW-NB15,
and CICIDS-2017) and evaluated using several metrics in-
cluding the accuracy, false positives rate, and Matthew’s
Correlation Coefficient. The obtained results prove that
their model outperforms simple DNN-based anomaly model
in addition to some state-of-the-art techniques (by achiev-
ing 89.97%, 92/83%, and 99.65% on the three aforemen-
tioned benchmark datasets respectively). However, they
have not preformed a scalability study of their model on
these datasets.
In Hamamoto et al. (2018), the authors present a novel
approach for anomaly detection which combines both ge-
netic algorithm and fuzzy logic. More specifically, the ge-
netic algorithm is deployed in order to better represent fin-
gerprints of network segments using network flow data. This
also allows to predict network traffic behaviours for specific
and pre-defined time windows. Then, fuzzy logic is used to
decide whether there are some anomalous behaviours within
those time-windows. Their approach was validated and eval-
uated on real-world network traffic data and it has proven to
be effective by achieving 96.53% of accuracy and 0.56% of
false positives rate.
Ma et al. (2021) proposed a novel approach for anomaly
detection on network traffic data, called SVM-L, which com-
bines both SVM and Linear Discriminant Analysis (LDA).
More specifically, URLs from the data are used as input and
converted into vector format using natural language process-
ing (NLP) and statistical techniques. Then, these vectors are
fed to the SVM model in order to classify them into anoma-
lies or normal. In addition, the authors used an optimization
algorithm in order to optimize the hyper-parameters of the
SVM classifier. The validation results of the SVM-L model
shows that it achieves 99% of accuracy on the tested datasets.
There exist several solutions (e.g., Alsaheel et al. (2021),
Shen and Stringhini (2019)) that improve the results of the
maliciousness segregation using advanced machine learning
and NLP techniques on log files. For instance, Shen and
Stringhini (2019) propose attack2vec to detect emerging net-
work attacks by leveraging dynamic word embeddings tech-
niques. Similar to NLP word embeddings, their approach
produces a dense representation of the security events while
Aniss Chohra et al.: Preprint submitted to Elsevier Page 3 of 19
Optimized Feature Selection for Network Anomaly Detection
considering the time factor. Moreover, in Alsaheel et al.
(2021), the authors propose Atlas, a framework for attack in-
vestigation that leverages NLP and deep learning techniques
to segregate attacks and non-attacks using logs as input. At-
las begins with processing the logs and building a causal de-
pendency graph between the events found in the logs. This
graph is augmented using NLP techniques and used to train
a sequence-base model that represents the attack semantics.
The produced models help the cyber analyst identify key at-
tack steps that share similarities with previous patterns. On
the contrary, our proposed IoT real-world dataset genera-
tion (presented in subsection 3.4) fingerprints malicious logs
from the IoT network traffic data by leveraging an ensem-
ble model constructed using several models/classifiers (e.g.,
Random Forests, XGBoost, CatBoost, NN, and CNN).
In Roy and Singh (2021), the authors present a study of
different anomaly detection classifiers before and after ap-
plying feature selection. More specifically, the authors com-
pare different machine learning classifiers by training each
model twice. In the first iteration, they use all the exist-
ing features from the dataset. During the second iteration,
they first tune the classifier with several feature selection al-
gorithms; then they select the feature selection algorithm
which gives the best accuracy results, and use the selected
features with the same classifier as for the first training it-
eration. The reported results show that the J84 classifier
achieves the highest accuracy.
The authors of Mahalakshmi et al. (2021) applied a con-
volutional neural network (CNN) model to detect anomalies
efficiently. Obtained result show that their proposed CNN
model achieves an accuracy of 93.5%. However, the authors
have not compared their work with any other state-of-the art
approaches.
In Min et al. (2021), the authors introduced a novel
network anomaly detection technique, called memory-
augmented deep auto-encoder (MemAE). Autoencoder was
used to reconstruct the behavior of abnormal samples that
look close to normal ones; thus, the authors are solving the
problem of over-generalization, which occurs with abnormal
samples on autoencoders.
Roy et al. (2022) propose a lightweight intrusion detec-
tion system, called B-Stacking, based on supervised ma-
chine learning. A series of feature transformations, dimen-
sionality reduction and feature selection methods are applied
to produce the learning features. Afterwards, the authors
propose B-Stacking, a machine learning ensemble that uses
K-Nearest Neighbors (KNN), Random Forest, and XGBoost
to detect network anomalies. The system is claimed to be
lightweight and targets IoT devices, however, the experi-
ments has been carried out on Intel Core i5-9400F CPU 2.90
GHz notebook with 8GB of RAM and the system consumes
3.4% of the RAM and 1.5% to 2.9% of the CPU in this high-
end notebook machine, which is considered highly demand-
ing for an IoT device. In addition, the detection run-time has
not been reported.
The authors in Rashid et al. (2022) propose a stacking en-
semble technique (SET) with SelectKBest feature selection
technique for network anomaly detection. First, dimension-
ality reduction and features selection are applied to segre-
gate relevant features. Next, an ensemble of Decision Trees,
Random Forest, and XGBoost machine learning models are
employed to detect anomalies. However, the use of Selec-
tKBest technique is less adaptive to new malicious network
traffic over the time.
3. Materials and Methods
In this section, we first provide background on the re-
lated topics, then we present an overview of our approach.
Next, we provide details on the proposed methodologies for
feature selection and anomaly detection. Finally, we present
our approach to generate the IoT-Zeek dataset.
3.1. Background
In the following, we provide an overview on Particle
swarm optimization (PSO) and ensemble methods.
3.1.1. Particle Swarm Optimization
Particle swarm optimization (PSO) (Kennedy and Eber-
hart (1995)) is a stochastic and meta-heuristic optimization
algorithm, which was first inspired by the social behaviour
of some animals (e.g., birds and fishes). In the PSO algo-
rithm, the population of individuals is referred to as swarm,
and each individual within the swarm is referred to as par-
ticle. These particles try to find the set of optimal solu-
tions to a given problem by constantly updating their posi-
tions according to their own performance, which is called
cognitive aspect, and the current overall performance of the
swarm is called social aspect. Thus, PSO is based on two
essential logic: cooperation/collaboration and competition,
where the former represents the ability of one particle to
communicate with other particles in order to collaborate
their efforts towards the optimal solutions, whilst the latter
represents one particle’s desire to use its own performance
and move towards the possible solution.
Moreover, each particle is defined within a search space,
which represents the set of hyper-parameters to be optimized
for the solution. Depending on the swarm’s global solution,
each particle computes the cognitive aspect and the social
aspect according to Equation 1 and Equation 2, respectively,
as follows:
𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 =𝑐1 × 𝑟1×(𝑝𝑜𝑠_𝑏𝑒𝑠𝑡𝑖−𝑝𝑜𝑠𝑖)(1)
𝑠𝑜𝑐𝑖𝑎𝑙 =𝑐2 × 𝑟2×(𝑝𝑜𝑠_𝑔𝑙𝑜𝑏𝑎𝑙 −𝑝𝑜𝑠𝑖)(2)
where 𝑐1and 𝑐2are called acceleration constants and de-
fine the speed at which a particle should move towards the
optimal solutions (𝑐1defines the speed at which the parti-
cle should converge to its local solution, whilst 𝑐2defines
the speed of convergence of the whole swarm towards the
global solution), 𝑟1and 𝑟2are two randomly generated val-
ues to control the stochastic influence of the cognitive and
social components on the overall velocity of a particle, 𝑝𝑜𝑠𝑖
represents the position of a particle at iteration i, 𝑝𝑜𝑠_𝑏𝑒𝑠𝑡𝑖
represents the local optimal solution found by that particle
Aniss Chohra et al.: Preprint submitted to Elsevier Page 4 of 19
Optimized Feature Selection for Network Anomaly Detection
so far, and 𝑝𝑜𝑠_𝑔𝑙𝑜𝑏𝑎𝑙 represents the position of the global
solution found by the entire swarm so far.
Afterwards, each particle updates its velocity using
Equation 3, where 𝑡represents the current particle, 𝑣𝑖(𝑡)rep-
resents the velocity of the current particle at iteration 𝑖, and
𝑤is the inertia weight (importance) given for that veloc-
ity (the smaller the weight 𝑤, the stronger the convergence
towards the global optimum). Finally, the position of each
particle is updated using Equation 4.
𝑣𝑖(𝑡+ 1) = 𝑤×𝑣𝑖(𝑡) + 𝑠𝑜𝑐 𝑖𝑎𝑙 +𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 (3)
𝑝𝑖(𝑡+ 1) = 𝑝𝑖(𝑡) + 𝑣𝑖(𝑡)(4)
3.1.2. Ensemble Methods
Ensemble methods are a type of machine learning mod-
els, which consist of a combination of multiple classifier-
s/predictors in order to improve the performance of the over-
all classification/prediction. In other words, ensemble meth-
ods combine the decisions made by multiple models using
techniques such as: majority voting, average, and weighted
average. Moreover, these techniques provide easier interpre-
tation of features and better predictive performance with less
overfitting compared to other machine learning techniques.
These family of machine techniques is generally classified
into two major types (Bühlmann (2012)), which are pre-
sented in Figure 1, as follows:
1. bagging, where all the used predictors are running
in parallel and independently, these models are then
combined using an aggregation technique to make the
final decision. An example of such types are ran-
dom forests, where a sample called bootstrap is se-
lected randomly and fed to one model. Therefore, each
model in the forest will have a different observation
and thus leading to no correlation between these pre-
dictors, making them less prune to overfitting.
2. boosting, deploys the paradigm of learning from each
one’s predecessor’s errors/mistakes (called residuals).
Therefore, these types of ensemble methods are exe-
cuted in a sequential order, which gives them an ad-
vantage over the first type consisting of less training
time delays. An example of ensemble techniques in-
cludes gradient boosting technique.
3.2. Approach Overview
In order to identify anomalous connections, we propose
a deep-learning based autoencoder anomaly detection. The
input to this model is a set of features obtained from the net-
work traffic connection logs. We propose a hybrid model
consisting of PSO algorithm and ensemble methods to iden-
tify the most relevant set of features for any given dataset.
During this process, we explore two types of fitness func-
tions (models); the first one belongs to the bagging ensem-
ble methods family (Random Forests), and the second one
belongs to the boosting method (gradient boosting).
Predictor 1
Predictor 2
Predictor 3
Bootstrap 1
Bootstrap 2
Bootstrap 3
Bagging Vs.
Bootstrap 1 Predictor 1
Bootstrap 2 Predictor 2
Bootstrap 3 Predictor 3
Boosting
Figure 1: Bagging vs. Boosting Ensemble Methods.
1. Search Space
Definition
2. Fitness and
Objective
Functions
Definition
3. Algorithm
Initialization 4. Iterative Search
5. Optimal
Solutions
Selection
Input
Dataset
PSO
Selected
Features
6.Deep
Learning
Anomaly
Detection
Optimized Feature Selection
Figure 2: Approach Overview.
An overview of our approach is represented in Figure 2.
Feature selection can be viewed as five sequential steps: fit-
ness and objective function definition,search-space defini-
tion,algorithm initialization,iterative search, and optimal
solutions selection. The proposed approach starts by defin-
ing the search space (Step 1) for PSO (Kennedy and Eber-
hart (1995); Ali and Malebary (2020)) depending on the
chosen fitness function (Step 2). Afterwards, it takes as in-
put any labeled dataset and initializes a fixed size popula-
tion/swarm by generating random particles (Step 3). Given
a precise number of iterations, each particle will then try
to find the optimal position of the solution by updating and
changing constantly its position within the search space ac-
cording to the performance of the swarm and its own per-
formance (Step 4). The goal of the swarm is to find the op-
timal model (optimal hyper-parameters) which maximizes
certain performance metrics (Step 5). Finally, we consider
only the set of best fitting settings (e.g., hyper-parameters),
which give us the highest accuracy metrics. We then use
these settings to build the final model(s) in order to extract
the most relevant features. Afterwards, we leverage the se-
Aniss Chohra et al.: Preprint submitted to Elsevier Page 5 of 19
Optimized Feature Selection for Network Anomaly Detection
lected features discovered during the optimization part and
engineer an anomaly detection model using deep learning
autoencoders (Step 6) Lauzon (2012). Our goal is to get an
anomaly detection model which outperforms existing mod-
els in terms of 𝑓1score metric.
3.3. Methodology
In this section, we provide more details on the proposed
methodology.
3.3.1. Optimized Feature Selection
Our feature selection algorithm can be performed as five
sequential steps. The algorithmic description of the opti-
mized feature selection is presented in Algorithm 1 and Al-
gorithm 2. In what follows, we explain each step in detail.
Algorithm 1 Feature Selection: Algorithmic Description
Input: 𝐷 ⊳ Input dataset
Output: 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠
1: global variables
2: 𝑐1,𝑐2⊳cognitive and social acceleration constants, respectively
3: 𝑟1,𝑟2⊳cognitive and social random factors, respectively
4: 𝑤,⊳velocity’s weight
5: 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒,
6: 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,
7: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠,⊳global solution fitness value
8: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛,⊳global solution’s position
9: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠_𝑙 𝑖𝑠𝑡,⊳optimal solutions positions and fitness values
10: end global variables
11: 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 ←DEFINE_F ITNES S
12: 𝑏𝑜𝑢𝑛𝑑𝑠 ←DE F_SEARC H_SPACE(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛)
13: (𝑠𝑤𝑎𝑟𝑚, 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒, 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠, 𝑐 1, 𝑐2, 𝑟1, 𝑟2, 𝑤)←ALGO RITHM _INIT
14: for each 𝑖∈𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 do ⊳Iterative search
15: for each 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑘∈𝑠𝑤𝑎𝑟𝑚 do
16: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 ←EVALUATE_FITNE SS(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘)
17: if 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 > 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 then
18: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 ←𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠
19: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ←𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
20: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠_𝑙𝑖𝑠𝑡 += 𝑔 𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
21: 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ←𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
22: end if
23: UPDATE_V ELOCIT Y(c1, c2, r1, r2, w, global_position, particle_k_position,
personal_best_position, particle_k_velocity)
24: UPDATE_PO SITION (particle_k_velocity, particle_k+1_position)
25: end for
26: end for
27: 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑡𝑖𝑜𝑛𝑠 ←𝑚𝑎𝑥(𝑓1_𝑠𝑐𝑜𝑟𝑒, 𝑔 𝑙𝑜𝑏𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠_𝑙𝑖𝑠𝑡)⊳Fitness and
objective function definition,Optimal solutions selection
28: return 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑡𝑖𝑜𝑛𝑠
Fitness and objective functions: First, we define the fitness
function to be used to evaluate the performance of each par-
ticle within the swarm. We choose ensemble methods clas-
sifiers due to their advantages and benefits, which include
low overfitting and high accuracy. Each particle is fed to
the classifier which in turn will automatically adapt to it and
will be trained on the dataset accordingly. At the end of this
process, the fitness function returns the following evaluation
metrics: accuracy, recall, precision, and 𝑓1score.
Next, we define the objective function to be satisfied by
the set of possible optimal solutions (evaluate the whole so-
lutions). The objective function helps filter a set of results
and keep only those that satisfy our needs. In this work, since
we integrate ensemble models as evaluation/fitness func-
tions, we should select a metric which best describes the
performance of these models at each particle level. From
the above-mentioned evaluation metrics, we choose the lat-
ter one (𝑓1score), since it represents the weighted aver-
Algorithm 2 Feature selection: Functions Definitions
1: procedure DEFINE _FITNE SS ⊳Fitness and objective function definition
2: 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 ←𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒_𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑟 ⊳ tuned between random forests
and gradient boosting
3: return 𝑓𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛
4: end procedure
5: procedure DEF_SE ARCH_S PACE(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛)⊳Search-space definition
6: if 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 == 𝑟𝑎𝑛𝑑𝑜𝑚_𝑓 𝑜𝑟𝑒𝑠𝑡 then
7: return 𝑏𝑜𝑢𝑛𝑑𝑠 ←[(0.1,0.4),(50,1000)]
8: else if 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡_𝑏𝑜𝑜𝑠𝑡𝑖𝑛𝑔 then
9: return 𝑏𝑜𝑢𝑛𝑑𝑠 ←[(0.1,0.4),(50,1000),(0.1,0.3)]
10: end if
11: end procedure
12: procedure ALGOR ITHM_IN IT ⊳Algorithm initialization
13: 𝑐1←[1,2] ⊳c1 is tuned using two different values: 1 and 2
14: 𝑐2←2
15: 𝑤←0.5
16: 𝑟1←𝑟𝑎𝑛𝑑𝑜𝑚,𝑟2←𝑟𝑎𝑛𝑑 𝑜𝑚
17: 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒 ←15
18: 𝑠𝑤𝑎𝑟𝑚 ←𝑟𝑎𝑛𝑑𝑜𝑚(𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒𝑠, 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒)
19: 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 ←30
20: return 𝑠𝑤𝑎𝑟𝑚,𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒 ,𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,𝑐1,𝑐2,𝑟1,𝑟2,𝑤
21: end procedure
22: procedure EVALUTE_FI TNESS(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒)
23: 𝑓 𝑖𝑡𝑛𝑒𝑠_𝑣𝑎𝑙𝑢𝑒 ←𝑓1_𝑠𝑐𝑜𝑟𝑒(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐𝑙 𝑒)
24: return 𝑓𝑖𝑡𝑛𝑒𝑠_𝑣𝑎𝑙 𝑢𝑒
25: end procedure
26: procedure UPDATE_VE LOCITY (𝑐1, 𝑐2, 𝑟1, 𝑟2, 𝑤, 𝑔𝑙𝑜𝑏𝑎𝑙 _𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛,
𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦)
27: 𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 =𝑐1 ∗ 𝑟1∗(𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 −𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)
28: 𝑠𝑜𝑐𝑖𝑎𝑙 =𝑐2 ∗ 𝑟2∗(𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 −𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)
29: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦 =𝑤∗𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦 +𝑠𝑜𝑐𝑖𝑎𝑙 +𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒
30: return 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦
31: end procedure
32: procedure UPDATE_PO SITION(𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)
33: update particle position:
34: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 =𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 +𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
35: return 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
36: end procedure
age of the precision and recall, taking both false positives
and false negatives into account, contrary to the accuracy
which takes only the true positives and true negatives into
account. Moreover, 𝑓1metric considers uneven or unbal-
anced datasets, where the target classes are not balanced. In
this case, our objective is to consider only the settings of the
models, which give us the highest values of 𝑓1score. Thus,
we define the objective function to be the maximization of
these values. This process will also prevent our algorithm
from falling into the local optimum and converge to a global
one.
Search space: Since we are using ensemble methods as par-
ticles’ evaluation/fitness function, we should choose the ap-
propriate set of hyper-parameters to be fed to the models.
Depending of the type of the model, i.e., bagging vs boost-
ing Bühlmann (2012), we select the most relevant hyper-
parameters that are of high importance to the learning pro-
cess of the learner/model (e.g., number of trees/estimators,
respective sizes for each of the training and testing splits,
etc.). This set of hyper-parameters will define the dimen-
sional space used by our algorithm in order to search for
possible optimal solutions.
For bagging techniques, there are two major types of
hyper-parameters that need to be investigated and optimized,
namely, test data size and number of estimators (trees). The
first one defines the size of testing data on which the model
should be tested, and consequently the size of training data
will be deduced automatically. It is generally recommended
Aniss Chohra et al.: Preprint submitted to Elsevier Page 6 of 19
Optimized Feature Selection for Network Anomaly Detection
to set testing data size smaller than that of training set (be-
tween 10% and 40%), thus we set the lower bound as 10%
and upper bound as 40%. This hyper-parameter will be de-
fined as the first dimension of each particle and based on
it, the fitness function will decide on how to split the input
dataset and train the appropriate ensemble model.
The second hyper-parameter that needs to be optimized
is the number of estimators, which represents the number of
decision trees that are part of the ensemble learning model.
Normally, the bigger the number of trees is, the better the
overall ensemble model will perform. However, there is a
limitation to this; at some point this improvement stops and
will start decreasing and result in badly predicted samples
and even overfitting. In addition, the bigger the number of
trees, the more computational cost is incurred, making the
experimentation more difficult for large-scale datasets. In
general practice, this hyper-parameter is decided with ex-
haustive experimentation by initiating the number of trees
with the smallest value, and at each iteration increasing it
slightly to improve the model’s performance compared to
the previous experimentation results. However, finding the
optimal number of estimators is very time consuming, es-
pecially in the case of large datasets which leads to days or
even months of experimentation. Moreover, this approach
does not exhaustively explore all the possible optimal val-
ues for the hyper-parameters; it is mainly performed based
on the knowledge and experiences of the experts. There-
fore, we propose to integrate this parameter within the opti-
mization algorithm as a hyper-parameter to be optimized for
the global solution. To improve the scalability of the opti-
mization algorithm, we choose this hyper-parameter to have
a value between 50 (lower bound) and 1000 (upper bound).
In boosting methods, new trees are added to the model in
order to correct the mistakenly predicted samples (residuals)
by the previous tree. This process has two effects; the first
one, which can be considered as a benefit, consists of faster
training times compared to bagging techniques. The second
one can be considered as a disadvantage, which makes the
model being more prune to overfitting. To overcome this
problem, the learning rate can be seen as a weight (percent-
age), which is introduced to control and reduce the number
of corrections to be made by the current tree (e.g., predic-
tor/classifier) from the previous one. As a result, the overall
performance of the model is improved when the learning rate
is much smaller and the number of trees is higher. There-
fore, in addition to the above-mentioned hyper-parameters,
the boosting ensemble methods require the third essential
hyper-parameter, learning rate, to be optimized. In gen-
eral practice, it is recommended to define this parameter to
a value between 0.1(lower bound) and 0.3(upper bound).
Thus, our optimization algorithm’s search space for the
bagging methods is defined as a 2-dimensional space, where
the first dimension represents the test size, and the second
one is the number of decision trees/estimators included in the
ensemble model. For the boosting methods, our search space
is defined as a 3-dimensional space, where each particle has
three parameters: (test size, number of trees, learning rate).
Algorithm initialization: We start by initializing the set-
tings for our PSO algorithm. First, we define the maximum
number of iterations, which can be viewed as the number
of chances given to the swarm in order to find the optimal
solutions. This parameter is primordial and essential since
in optimization problems we only know that the problem to
be solved might have multiple optimal solutions. However,
we do not know the exact number of these optima; if the
number of chances is too large, the algorithm in question
can take tremendous amounts of times. On the other hand,
the performance of the optimization algorithm to find more
optimal solutions gets better when the number of iterations
increases. However, to limit the time consumption factor,
we fix this setting to a value of 30 iterations.
Moreover, we need to define the values for the accel-
eration constants (𝑐1and 𝑐2) and the weight (𝑤)Kennedy
and Eberhart (1995). For the first ones, it is recommended
to set them such that their product is between 0and 4(0≤
𝑐1 × 𝑐2≤4)Marini and Walczak (2015). We run the algo-
rithm two times; the first time we set these two constants to
equal values (set both to 2), whilst in the second execution,
we give more importance (speed) to the global solution by
setting 𝑐2to 2and 𝑐1to 1. The intuition is to start by giving
the same importance to the local and global solutions, then
increase the importance of the global solution (𝑐2) and check
which setting allow us to explore better optimal solutions.
Next, we initialize the swarm (population of particles)
by first defining a fixed number of particles (individuals)
(𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒), which consists of the swarm. For each par-
ticle, we randomly generate its respective velocity and po-
sition such that they are selected within our pre-defined
bounds (search space definition). We initialize the global
fitness value (𝑓1) of the overall swarm to be equal to 0.5.
Iterative search: During the iterative process, each initi-
ated particle in the initial swarm gets evaluated using the
fitness function (ensemble model classifier) using its own
coordinates. If after the particle’s fitness evaluation, the 𝑓1
score of that particle is found to be greater than the global
(swarm’s) 𝑓1score, then it first updates the global 𝑓1score
to its own value, and sets the global solution’s position to its
own position. Next, it updates its position (particle) using
the appropriate velocity and position functions presented in
Algorithm 2 (line 29 and line 34, respectively).
As for the time complexity of this process, it is of the
order of 𝑂(𝑛𝑚); where 𝑛represents the maximum number
of iterations (line 14 in Algorithm 1) and 𝑚represents the
population/swarm size (line 15 in Algorithm 1). However,
since in our experiments 𝑛and 𝑚have small values (the max-
imum number of iterations is equal to 30 and the size of the
swarm is equal to 15), our approach does not encounter high
time complexity issue. As for the fitness function (line 16
in Algorithm 1), the (training) time complexity of the mod-
els (i.e., Random Forests or XGBoost) is of the order of
𝑂(𝑘.𝑣.𝑛.𝑙𝑜𝑔(𝑛)), where 𝑘is the number of trees, 𝑣is the num-
ber of features, and 𝑛is the number of records/rows. Due to
the presence of a bottleneck in our algorithm for evaluat-
ing the fitness of each particle (either by training Random
Aniss Chohra et al.: Preprint submitted to Elsevier Page 7 of 19
Optimized Feature Selection for Network Anomaly Detection
Forests or XGBoost models), we leverage multiprocessing
paradigm by taking advantage of 20 CPU cores of our setup
environment. On the other hand, since we use a server with
128 GB of RAM (presented in subsection 4.1), the space
complexity does not consist a bottleneck in our algorithm.
Therefore, our approach is sufficiently efficient and scalable
on the studied datasets and their respective models. Per-
formed experiments (reported in subsection 4.6) confirm the
scalability and efficiency of our proposed approach.
It is worth noting that one particle can fall into the case
where the second dimension (number of trees) is not an in-
teger value. This is problematic due to the fact that the num-
ber of decision trees making the ensemble model ought to
be an integer value. Therefore, in that case, we round the
value to the closest integer value. Furthermore, if at any it-
eration, a particle novel position is found to be outer of the
search-space bounds (e.g., [0.1,0.4] for first dimension and
[50,1000] for the second one), we correct the out of bound
value to the closest bound. For instance, if a new particle’s
position is (0.5,1200), we correct it to position (0.4,1000).
Optimal solutions: Finally, after the maximum iterations
are reached, we apply a maximization function, which takes
as input all the possible solutions explored during the itera-
tive search and returns only the ones with the highest fitness
value (𝑓1score). If more than one optimal solution is found
(giving the same 𝑓1score value), we select the one with a set
of hyper-parameters that induce the best efficiency (e.g., ex-
ecution time and CPU usage). Then, the appropriate model
using the selected optima’s hyper-parameters is trained and
only those features with importance values higher than the
average of all features importance are selected for the next
phase (e.g., anomaly detection).
3.3.2. Deep Learning-Based Anomaly Detection
After selecting the optimal feature selection model and
using it to select the most relevant features, we use the fil-
tered dataset using selected features to generate an efficient
anomaly detection model using autoencoders. To this end,
we start by taking the most accurate models for that dataset
from the existing state-of-the-art models. We aim at re-
ducing our search for the appropriate model selection by
using the most efficient one proposed as a starting point.
Then, we feed the model with only the selected features of
the dataset, which will help reduce the autoencoder model’s
training time. Since we do not use all the features, thus the
compression and bottleneck generation (encoder) as well as
the reconstruction phase (decoder) will run faster compared
to the case of feeding all the features as input.
There are multiple hyper-parameters that we need to tune
in order to find the optimal autoencoder model: batch size,
loss function,number of layers,number of neurons, and reg-
ularizations. Once we reach a point where our model outper-
forms the state-of-the-art models (e.g., Yang et al. (2019)),
we stop the search algorithm and select that model as the
optimal one. We then use l1_norm to compute the distance
between the input and the reconstruction data. This result is
then compared to the input labels (ground truth) and differ-
Listing 1: Malicious and Benign Traffic Logs Sources
−−−−−−−−−−−−−−−−− Malicious Traffic Logs Sources −−−−−−−−−−−−−
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet
−370−1/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet
−371−1/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet
−372−1/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Malware−Capture−Botnet
−373−1/bro/conn.log
−−−−−−−−−−−−−−−−− Benign Traffic Logs Sources −−−−−−−−−−−−−−−
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−25/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−26/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−27/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−28/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−29/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−30/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−31/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−32/bro/conn.log
∗https://mcfp.felk.cvut.cz/publicDatasets/CTU−Normal−33/bro/conn.log
ent threshold values are tuned to select the one that gives the
highest accuracy metrics on each dataset.
3.4. IoT-Zeek Dataset Generation
In the following, we describe the methodology used to
generate the IoT-Zeek dataset of malicious and benign net-
work traffic. We first deploy a real environment which con-
sists of various raspberry pi devices that communicate with
each other. We install Zeek sensors to monitor the network
traffic and extract connection logs (conn.log) generated by
the Zeek NIDS Paxson (1999); Team (2018). Then, we in-
ject different malware samples to these devices and capture
malicious network traffic. These connection logs are then
classified using classical machine learning and deep learning
models into malicious or benign connections (as explained
later). A portion of the dataset, which contains 150,000
records (connections), consisting a total of 129,441 mali-
cious connections and 20,559 benign connections, is sam-
pled.
Malware and Benign Sources: To ensure the freshness of
our dataset regrading the malicious / benign IP addresses,
we collect PCAPS from both Concordia SecLab malware
feed (in house source) and Stratosphere Research Labora-
tory Laboratory (2018). Then, we build the global training
dataset from the labeled Zeek logs. The malicious traffic
logs and the benign traffic logs are retrieved from the sources
presented in Listing 1. The number of malicious and be-
nign connections in the evaluation dataset are 1,764,604 and
278,998, respectively.
3.4.1. Maliciousness Classification
In this section, we present employed ensemble models
to classify the malicious activities on the IoT-Zeek dataset.
As depicted in Figure 3, the first set of models belongs to
classical machine learning, while the second one belongs to
deep learning. The input to the models is all connection log
features as presented in Table 1 (there exist some other fea-
tures, which are not extracted from PCAP files by Zeek in
Aniss Chohra et al.: Preprint submitted to Elsevier Page 8 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 3: Maliciousness Detection Pipeline.
Table 1
Zeek’s Connection Log File Features Description.
Feature Typ e Description
ts Numerical Unix Timestamp format of the connection’s occurance date.
id.orig_h Categorical Originator’s IP address.
id.orig_p Categorical Originator’s TCP/UDP port.
id.resp_h Categorical Responder’s IP address.
id.resp_p Categorical Responder’s TCP/UDP port.
proto Categorical The transport layer protocol (tcp, udp, or icmp).
service Categorical The application layer requested protocol (e.g; ssh, dns, etc.)
orig_ip_bytes Numerical Number of bytes sent from the originator to the responder; this is extracted from the packet header.
resp_ip_bytes Numerical Number of bytes sent from the responder to the originator.
orig_pkts Numerical Number of packets sent from the originator to the responder.
resp_pkts Numerical Number of packets sent from the responder to the originator.
conn_state Categorical A string giving an overview description about the state of the connection.
history Categorical A string giving more details about the state of the connection.
the default setting)5.
As for the classical machine learning classification, we
employ RandomForest,XGBoost,LightGBM, and CatBoost
classifiers. We choose these classifiers due to their high
performance and reputation in the industry. Moreover, the
chosen classifiers were parts of many winning solutions in
machine learning competitions6. In addition to the classi-
cal machine learning models, we deploy two deep learning
models for maliciousness detection. This includes the con-
volutional neural networks (CNN) and the feed forward neu-
ral networks (NN). More specifically, we customize the ar-
chitecture of CNN model to learn the maliciousness of the
network traffic, as shown in Figure 4. Moreover, the details
of the model are presented in Table 2. Other parameters,
such as Filters, are obtained from experiments and trade off
between the size of the model and the accuracy of the model.
Kernel and Stride have pretty standard values in many ma-
chine learning papers in the context of CNN. The feed for-
ward neural network architecture is a typical neural network
with fully connected layers. The details of the model are
listed in Table 3.
Ensemble Models: Training the aforementioned machine
learning classifiers on the connection logs features produces
a set of models 𝑀= {𝑐𝑀1, 𝑐𝑀2, 𝑐𝑀3, 𝑐𝑀4, 𝑑𝑀1, 𝑑 𝑀2},
5https://docs.zeek.org/en/lts/scripts/base/protocols/conn/main.
zeek.html
6https://www.kaggle.com/competitions
Table 2
Dimension Convolutional Neural Network Model Details.
Block # Layers Options
Block1
1 Conv Filter=64, Kernel=(3,1), Stride=(1,1),
ZeroPadding, Activation=ReLU
2 BNorm BatchNormalization
3 MaxPooling Kernel=(2,2), Stride=(2,2), Zero-Padding
Block2
4 MaxPooling Global Max Pooling
5 Fully Connected #Output=512, Activation=ReLU
6 Fully Connected #Output=1, Activation=Sigmoid
Table 3
Feed Forward Neural Network Model Details.
# Layers Options
1 Fully Connected #Output=128, Activation=ReLU
2 Batch Normalization Batch Normalization
3 Fully Connected #Output=256, Activation=ReLU
4 Batch Normalization Batch Normalization
5 Fully Connected #Output=512, Activation=ReLU
6 Batch Normalization Batch Normalization
7 Fully Connected #Output=512, Activation=ReLU
8 Fully Connected #Output=1, Activation=Sigmoid
where 𝑐𝑀𝑖represents a classical machine learning mod-
el/learner (i.e., RandomForest,XGBoost,LightGBM, and
CatBoost classifiers) and 𝑑𝑀𝑖represents a deep learning
model/learner (i.e., CNN and NN). To perform ensemble
learning, we use ensemble averaging technique as presented
Aniss Chohra et al.: Preprint submitted to Elsevier Page 9 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 4: CNN Maliciousness Detection Model’s Architecture.
in Equation 5, as follows:
𝑌′(𝑋, 𝛼 ) = |𝑀|
∑
𝑖=1
𝛼𝑖×𝑦𝑖× (𝑋)(5)
where 𝑌′is the ensemble probability likelihood, 𝑋is the
input feature vector, 𝛼𝑖are weights, and 𝑦𝑖are the prob-
ability likelihood of each single model. Each individual
model/learner (classical machine learning and deep learn-
ing models as presented in Figure 3) produces a single
probability (𝑦𝑖), which represents the maliciousness likeli-
hood. These models detection’ probabilities are input to
the weighted average ensemble. This technique employs a
weighted average using 𝛼𝑖weights to produce the ensemble
prediction. In the current setting, we choose 𝛼𝑖= 1 for all
the models, which indicates that all the models have equal
contribution in the final decision.
3.4.2. System Adaptation
Adaptation to new network threats and attacks is an im-
portant criterion in the network traffic malicious detection.
In our context, we provide this capability thought the au-
tomation of the model generation process. As shown in Fig-
ure 5, the system leverages a feed of malicious traffic (in
form of PCAP files) to build an updated training dataset ev-
ery epoch. The updated training dataset is representative of
the state-art-the-art network attacks and benign traffics. The
system insures the quality of the produced model by using
validation and testing datasets, and only models that achieve
high detection performance will be deployed in production.
4. Evaluation Results and Discussion
In this section, we first describe the experimental setup,
and the benchmark datasets. Then, we provide more details
on the validation of our proposed feature selection approach
on each of the chosen benchmark datasets. Next, we report
the accuracy of our anomaly detection model on different
datasets and compare our results with the state-of-the-art ap-
proaches. Finally, we provide the results of our efficiency
study.
4.1. Experimental Setup
All our experiments are conducted on a computation
server with an Intel Xeon E5-2630 2.30 GHz CPU with 24
cores and 128 GB of RAM, and CentOS Linux version 7
installed on it. Our system prototype is developed using
Python 3.6 programming language and PyTorch by leverag-
ing sklearn and Scikit libraries for bagging ensemble learn-
ing techniques (random forest classifier), xgboost library for
boosting ensemble techniques (gradient boosting classifier)
and other machine learning models. Multiprocessing is de-
ployed for fast models’ training by taking advantage of 20
cores of the CPU for both Random Forest and XGBoost clas-
sifiers. We use pandas API in order to load and preprocess
each API. We adapt the autoencoders models by utilizing the
keras API in conjunction with tensorflow.
Evaluation Metrics. To evaluate the performance of our
approach, we use the accuracy, precision, recall and 𝐹1score
metrics that are defined as follows:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇 𝑃 +𝑇 𝑁
𝑇 𝑃 +𝑇 𝑁 +𝐹 𝑃 +𝐹 𝑁
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃 , 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁
𝐹1= 2 ⋅
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙
4.2. Benchmark Datasets Description
In this subsection, we introduce the two benchmark
datasets as follows.
NSL-KDD Dataset: A network dataset, called NSL-KDD
(Tavallaee et al. (2009)), was proposed to fix two main issues
(e.g., redundant records and level of difficulty) related to its
predecessor KDD’99 dataset. The updated dataset (NSL-
KDD) contains a total of 148,517 network flow records, with
77,054 being labeled as normal records and 71,463 being
Aniss Chohra et al.: Preprint submitted to Elsevier Page 10 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 5: Machine Learning Models Development.
labeled as attacks. The dataset consists of a total of 41 fea-
tures, 32 of which are numerical (integer or float) type and 9
features have categorical values.
UNSW-NB15 Dataset: The UNSW-NB15 dataset
(Moustafa and Slay (2015); Moustafa and Slay (2016a);
Moustafa et al. (2019); Moustafa et al. (2017)) was created
by the Cyber Range Laboratory for Cyber Security (ACCS)
using IXIA PerfectStorm framework that contain real normal
and attack behaviours. Tcpdump is then used as a framework
to capture 100GB of network traffic activity. The dataset
consists of nine types of cyber attacks: Fuzzers,Analysis,
Backdoors,Denial of Sevice (DoS),Exploits,Generic,Re-
connaissance,Shellcode, and Worms. Moreover, generated
dataset contains a total of 49 features, 42 of which are
numerical (integer or float) type, and 6 are of categorical
type. This dataset contains 2,218,761 normal and 321,283
attack records.
4.3. Feature Selection Results
We apply our proposed optimized feature selection so-
lution on the three aforementioned datasets. The results for
each of the two explored fitness functions, Random Forests
(bagging) and XGBoost (boosting), are detailed respectively
in Table 4 and Table 5. We observe that the latter fitness
function (XGBoost) achieves the highest fitness values (𝑓1
score) on all three datasets. Moreover, when the 𝑐2accel-
eration constant is higher than 𝑐1(𝑐2=2and 𝑐1=1), the
algorithm performs better in finding better optimal solutions
for two of the datasets, while for the NSL-KDD dataset both
settings give the same values of 𝑓1score.
Afterwards, for each of these set of hyper-parameters se-
lected for each dataset, we train the appropriate model (XG-
Boost), and extract the list of features with their correspond-
ing importance values. Then, we compute their averages and
select only the ones which their importance is higher than
the obtained average value. The results of this process are
presented in Table 6.
Effects of Imbalanced Data: We further examine the ef-
fects of imbalanced data on our feature selection approach.
As presented in Section subsection 3.4, IoT-Zeek dataset has
a smaller number of benign connections compared to mali-
cious connections, which may influence machine learning
algorithms to ignore the minority class. According to the
literature Fernández et al. (2018), oversampling and under-
sampling techniques are recommended to overcome this is-
sue. To this end, we leverage SMOTE and RUS python li-
braries7and apply both oversampling and undersampling
techniques on the IoT-Zeek data. According to obtained re-
sults, oversampling technique slightly outperforms under-
sampling technique. Consequently, we consider the over-
sampled dataset during our experiments and refer to it as
IoT-Zeek-Oversampled dataset.
We apply our optimized feature selection solution on the
IoT-Zeek-Oversampled dataset. The results of the two ex-
plored fitness functions are presented in Table 7. We ob-
serve that the XGBoost fitness function achieves the high-
est fitness values (𝑓1score). More specifically, when the 𝑐2
acceleration constant is equal to 𝑐1(𝑐2 = 2 and 𝑐1 = 2),
the XGBoost algorithm outperforms in finding better opti-
mal solutions. Afterwards, for each of the selected set of
hyper-parameters, we train the appropriate XGBoost model,
7https://imbalanced-learn.org/stable/index.html
Aniss Chohra et al.: Preprint submitted to Elsevier Page 11 of 19
Optimized Feature Selection for Network Anomaly Detection
Table 4
Feature Selection Results using Random Forests Classifier as Fitness Function (Acceleration
constant c2 is fixed to 2 whilst c1 is tuned, 𝑓1score is the fitness function)
.
Dataset c1 Test size #Trees Accuracy 𝑓1score Precision Recall
NSL-KDD
20.1 71 99.52 99.52 99.52 99.52
0.1 70 99.52 99.52 99.52 99.52
0.1 323 99.51 99.51 99.51 99.5
0.103 295 99.51 99.51 99.51 99.51
0.12 50 99.5 99.51 99.51 99.5
0.1 107 99.52 99.51 99.52 99.51
0.15 258 99.51 99.51 99.51 99.5
0.2 50 99.52 99.51 99.52 99.51
0.1 424 99.5 99.5 99.5 99.5
10.1 50 99.549 99.549 99.549 99.54
0.1 63 99.51 99.51 99.51 99.51
0.1 51 99.51 99.51 99.51 99.51
0.1002 153 99.51 99.51 99.51 99.51
0.105 50 99.51 99.51 99.51 99.51
UNSW-NB15
20.1 50 99.93 99.49 99.28 99.69
0.1 84 99.93 99.43 99.15 99.71
0.103 50 99.93 99.42 99.17 99.68
10.1 1000 99.92 99.4 99.1 99.70
0.1005 1000 99.92 99.4 99.1 99.70
0.106 1000 99.92 99.4 99.09 99.70
IoT-Zeek Dataset
20.1 724 99.99 99.99 100 99.98
0.111 106 99.99 99.99 100 99.98
0.112 980 99.99 99.99 100 99.98
10.214 80 99.99 99.99 100 99.98
0.103 111 99.99 99.99 100 99.98
0.295 50 99.99 99.99 100 99.98
Table 5
Feature Selection Results using XGBoost Classifier as Fitness Function (Acceleration con-
stant c2 is fixed to 2 whilst c1 is tuned, 𝑓1score is the fitness function).
Dataset c1 Test size #Trees Learning rate Accuracy 𝑓1score Precision Recall
NSL-KDD
20.102 376 0.162523 99.75 99.75 99.75 99.75
0.13 327 0.138372 99.75 99.75 99.75 99.73
0.104 292 0.16305 99.75 99.75 99.75 99.75
0.1 233 0.1473 99.739 99.739 99.739 99.73
0.10558 241 0.17077 99.73 99.73 99.7 99.73
10.106 680 0.1 99.75 99.75 99.75 99.75
0.105 681 0.1 99.75 99.75 99.75 99.73
0.106 686 0.1 99.75 99.75 99.75 99.75
UNSW-NB15
20.1 903 0.1 99.97 99.76 99.71 99.82
0.1 824 0.1 99.97 99.76 99.70 99.82
0.1 827 0.1 99.97 99.76 99.71 99.82
0.16 903 0.102 99.90 99.60 99.70 99.8
0.1 899 0.10013 99.90 99.60 99.70 99.8
10.1 1000 0.1 99.90 99.80 99.80 99.87
0.1 1000 0.137 99.96 99.70 99.65 99.77
0.158 1000 0.141 99.96 99.71 99.68 99.74
0.17 816 0.12 99.96 99.69 99.67 99.71
0.114 425 0.159 99.96 99.67 99.57 99.77
0.137 481 0.125 99.96 99.67 99.61 99.73
IoT-Zeek Dataset
20.4 1000 0.294 99.90 99.90 99.90 99.90
0.4 1000 0.158 99.90 99.90 100 99.90
0.325 730 0.3 99.90 99.90 100 99.90
0.1 411 0.1 99.90 99.90 100 99.98
1
0.369 510 0.3 99.99 99.99 100 99.99
0.360 624 0.3 99.99 99.99 100 99.98
0.397 382 0.132 99.99 99.99 99.99 99.99
0.361 664 0.3 99.99 99.99 100 99.97
and extract the list of features with their corresponding im-
portance values. Then, we select only the features with an
importance higher than their average values, as presented in
Table 8.
4.4. Anomaly Detection Results
In this section, we describe the architecture of our au-
toencoder models for each of the utilized datasets, and
call them NSL-KDD Model, UNSW-NB15 Model, IoT-
Zeek Model, and IoT-Zeek-Oversampled Model. Then, we
present the results of models, and then we compare the re-
sults obtained for each dataset’s model with the state-of-the-
art approaches presented in Yang et al. (2019).
NSL-KDD Model: After several iterations of model train-
ing, we found that the optimal anomaly detection model for
this dataset has five hidden layers: two for the encoder (128
and 64 neurons respectively), one layer for the bottleneck
(32 neurons), and two others for the decoder (64 and 128
Aniss Chohra et al.: Preprint submitted to Elsevier Page 12 of 19
Optimized Feature Selection for Network Anomaly Detection
Table 6
Selected Features on each Dataset using the Optimal Solution Hyper-parameters. (Accel-
eration constant 𝑐2is fixed to 2and 𝑐1is tuned between 1and 2).
Dataset & Model Feature name Feature importance
NSL-KDD src_bytes 0.298222400
Test size: 0.106 num_failed_logins 0.131071240
Number of trees: 680 service 0.074615410
Learning rate: 0.1 diff_srv_rate 0.054890107
flag 0.039971426
hot 0.037307087
count 0.027289085
dst_host_srv_diff_host_rate 0.025930267
dst_host_same_srv_rate 0.024637770
UNSW-NB15 sttl 0.087299424
Test size: 0.1 ct_state_ttl 0.059610307
Number of trees: 1000 dsport 0.018620330
Learning rate: 0.1 proto 0.007498254
IoT-Zeek Dataset ts 0.672937750
Test size: 0.369 id_orig_p 0.121201570
Number of trees: 510 history 0.110579970
Learning rate: 0.3 resp_ip_bytes 0.089164086
Table 7
Feature Selection Results on IoT-Zeek-Oversampled Dataset.
Fitness Function (𝑓1score) c1 Test size #Trees Accuracy 𝑓1score Precision Recall
Random Forest (C2=2)
20.4 478 99.99 99.99 99.99 99.99
0.4 518 99.99 99.99 99.99 99.99
0.4 534 99.99 99.99 99.99 99.99
10.4 50 99.99 99.99 99.99 99.99
0.4 116 99.99 99.99 100 99.99
0.4 431 99.99 99.99 99.99 99.99
(C2=2)
20.4 50 100 100 100 100
0.3054 50 100 100 100 100
0.4 50 100 100 100 100
10.4 492 99.99 99.99 99.99 99.99
0.3963 679 99.99 99.99 99.99 99.99
0.4 739 99.99 99.99 99.99 99.99
respectively). In addition, we used two activities regular-
ization functions to deal better with the overfitting, namely,
dropout=0.5 and l2 norm for kernel regularization at each
layer with a value of 0.001 (as shown in the first part of Ta-
ble 9). Moreover, this autoencoder is a deeply connected
autoencoder, such that all layers are of Dense layer type.
Each of these layers uses Relu as activation function. The
optimal size of batches is set to be 32 with the testing data
set size equal to the optimal one found in the optimiza-
tion algorithm (0.106). Additionally, we tuned the model
with three different loss functions: categorical crossentropy,
mean squared error, and mean absolute error. The results
of this model validation with different thresholds are pre-
sented in Table 10. As can be seen, the model performs bet-
ter with categorical crossentropy as the loss function, with an
optimal threshold equal to 0.512, achieving approximately
92.09% average of 𝑓1score metric.
UNSW-NB15 Model: By using this dataset and after ap-
plying the same model tuning steps used for the NSL-KDD
dataset, we found that the optimal model has exactly the
same regularization values at each layer (dropout=0.5 and
kernel_regularizer_l2=0.001). However, there are two dif-
ferences compared to the previous model on the NSL-KDD
dataset. First, with this dataset there are exactly seven hid-
den layers (enocder=[512,256,128], bottleneck=[64], and
Table 8
Selected Features on IoT-Zeek-Oversampled Dataset using the Optimal Solution Hyper-
parameters.
Model Feature name Feature importance
ts 0.254218453
Test size: 0.3905 resp_ip_bytes 0.161528458
Number of trees: 50 resp_pkts 0.152892584
Learning rate 0.2648 resp_bytes 0.084122151
id_orig_p 0.083250296
Aniss Chohra et al.: Preprint submitted to Elsevier Page 13 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 6: Deep Learning Anomaly Detection: Train and Validation Loss.
Figure 7: Autoencoder Anomaly Detection ROC Curves.
decoder=[128,256,512]), as shown in the second part of Ta-
ble 9, with the batch size being set to 64. Second, this model,
contrary to the previous one, performs better with the mean
squared error loss function, achieving an optimal threshold
of 2.239 with an overall 𝑓1score average equal to 92.904 (as
shown in Table 10).
IoT-Zeek Model: We achieve an overall average of 𝑓1score
equal to 97.302 on this dataset (as shown in Table 10) using
mean squared error loss function. The autoencoder model
for this dataset is the same as the one for NSL-KDD, except
that we give it a smaller value for the kernel regularization
function (0.0001), as shown in the third part of Table 9.
IoT-Zeek-Oversampled Model: We apply the same hyper-
parameters that were tuned for the NSL-KDD Model as
shown in Table 9, and train our autoencoder model on the
IoT-Zeek-Oversampled dataset using mean squared error
loss function. The ROC curve is shown in Figure 8, where
the model achieves 99% area under the curve (AUC). More-
over, obtained results for different threshold values are pre-
sented in Table 10. As seen, we achieve an 𝑓1score of
94.300 on the oversampled data, while the obtained 𝑓1score
on the original data was 97.302. The reason for the 3% drop
in the 𝑓1score can be explained with the selected features
and their importance before and after oversampling, as pre-
sented in Table 6 and Table 8, respectively. Since feature
selection technique applies statistical methods to find the
best features, if the population (the dataset) changes (due to
over/under -sampling), the importance of selected features
will most likely be different, which will affect the overall
performance.
Figure 8: Autoencoder Anomaly Detection ROC Curve on IoT-
Zeek-Oversampled Dataset.
We further measure the training and validation loss for
each of the aforementioned models as presented in Figure 6;
where we can see that for all three models, there is no over-
fitting and each model’s loss becomes stable around epoch
6. Moreover, Figure 7 shows the Receiver Operating Char-
acteristic (ROC) curve for each one of the aforementioned
models (NSL-KDD Model, UNSW-NB15 Model, and IoT-
Zeek Model explained in subsection 4.4), where all the three
models achieve more than 90% area under the curve (AUC),
with IoT-Zeek Model achieving almost 100% AUC.
4.5. Comparative Study
We further compare our two models that we trained
on both NSL-KDD and UNSW-NB15 benchmark datasets,
with the most prominent state-of-the-art anomaly detection
recent models (e.g., the ones presented in Yang et al. (2019))
applied on these datasets. According to our experiments,
our proposed autoencoders outperform them in terms of 𝑓1
score metric. The results of this comparison are depicted in
Figure 9a and Figure 9b, for NSL-KDD and UNSW-NB15
respectively, where for both datasets, our proposed models
achieve the highest values of 𝑓1score. Moreover, we com-
pare the performance of our proposed approach with the
aforementioned selected work in terms of accuracy metric on
Aniss Chohra et al.: Preprint submitted to Elsevier Page 14 of 19
Optimized Feature Selection for Network Anomaly Detection
Table 9
Proposed Autoencoder Architecture by Dataset.
Dataset Encoder Bottleneck Deco der Regularizations
NSL-KDD 1. layer 1: 128 neurons
2. layer 2: 64 neurons
layer 3: 32 neurons 1. layer 4: 64 neurons
2. layer 5: 128 neurons
•Dropout: 0.5
•L2-regularizer: 0.001
UNSW-NB15 1. layer 1: 512 neurons
2. layer 2: 256 neurons
3. layer 3: 128 neurons
layer 4: 64 neurons 1. layer 5: 128 neurons
2. layer 6: 256 neurons
3. layer 7: 512 neurons
•Dropout: 0.5
•L2-regularizer: 0.001
IoT-Zeek 1. layer 1: 128 neurons
2. layer 2: 64 neurons
layer 3: 32 neurons 1. layer 4: 64 neurons
2. layer 5: 128 neurons
•Dropout: 0.5
•L2-regularizer: 0.0001
IoT-Zeek-
Oversampled 1. layer 1: 128 neurons
2. layer 2: 64 neurons
layer 3: 32 neurons 1. layer 4: 64 neurons
2. layer 5: 128 neurons
•Dropout: 0.5
•L2-regularizer: 0.001
Table 10
Chameleon Deep Learning Anomaly Detection Results.
Dataset & Model Threshold Accuracy Precision Recall 𝑓1score
Dataset: NSL-KDD 0.105 86.191 81.481 98.021 88.989
Loss function: categorical crossentropy 0.087 86.072 80.618 99.439 89.045
Training Time: 6mins, 13sec 0.177 87.833 84.065 97.016 90.077
2.190 89.092 90.753 90.010 90.380
1.758 89.532 90.640 91.008 90.824
1.018 89.607 89.965 92.005 90.974
0.837 90.073 89.900 93.010 91.429
0.314 90.006 87.624 96.002 91.622
0.742 90.592 89.922 94.008 91.920
0.512 90.711 89.351 95.005 92.092
Dataset: UNSW-NB15 1.352 84.382 79.346 98.099 87.731
Loss function: mean squared error 1.342 84.302 78.795 99.088 87.784
Training Time: 28mins, 38sec 2.365 86.058 86.124 90.010 88.024
2.346 86.387 85.913 91.008 88.387
2.325 86.728 85.705 92.036 88.758
2.305 87.083 85.546 93.026 89.129
2.286 87.456 85.406 94.031 89.511
2.265 87.757 85.169 95.044 89.836
2.162 87.629 83.805 97.016 89.927
2.239 89.523 90.00 96.002 92.904
Dataset: IoT-Zeek 3.101 98.158 96.344 90.004 93.066
Loss function: mean squared error 3.098 98.288 96.329 91.002 93.590
Training Time: 8mins, 32sec 3.097 98.407 96.235 92.001 94.070
3.094 98.530 96.157 93.012 94.558
3.092 98.651 96.080 94.010 95.034
3.090 98.777 96.043 95.009 95.523
3.088 98.894 95.944 96.007 95.975
3.086 99.019 95.897 97.005 96.448
3.084 99.134 95.789 98.003 96.884
3.082 99.246 95.659 99.002 97.302
Dataset: IoT-Zeek-Oversampled 95.921 97.321 90.444 90.004 90.223
Loss function: mean squared error 90.063 97.458 90.538 91.002 90.770
89.015 97.595 90.631 92.001 91.311
87.177 97.734 90.724 93.012 91.854
85.143 97.871 90.813 94.010 92.384
3.364 97.817 86.922 99.002 92.569
82.043 97.979 90.719 95.009 92.814
78.225 98.099 90.694 96.00 93.275
78.037 98.236 90.781 97.005 93.790
76.116 98.373 90.866 98.003 94.300
both datasets. The results of this comparative study are re-
ported in Figure 11 for both NSL-KDD dataset and UNSW-
NB15 dataset.
More specifically, we compare the accuracy results ob-
tained during the testing (using hold-out dataset) of our au-
toencoders models (NSL-KDD Model and UNSW-NB15
Model) trained on both benchmark datasets with the reported
accuracy results of existing state-of-the-art models tested on
NSL-KDD dataset (e.g, Yang et al. (2019); Ma et al. (2016);
Javaid et al. (2016); Tang et al. (2016); Imamverdiyev and
Aniss Chohra et al.: Preprint submitted to Elsevier Page 15 of 19
Optimized Feature Selection for Network Anomaly Detection
(a) NSL-KDD Dataset. (b) UNSW-NB15 Dataset.
Figure 9: A Comparative Study of Anomaly Detection Approaches in terms of 𝑓1score.
Table 11: A Comparative Study of Anomaly Detection Approaches in terms of Accuracy.
Approach NSL-KDD Dataset UNSW-NB15 Dataset Average Accuracy
Chameleon990.71% 89.52% 90.115%
Rashid et al. (2022) 99.90% 94.00% 96.95%
MemAE Min et al. (2021) 89.51% 85.30% 87.405%
Roy et al. (2022) 98.50% Not Reported -
CNN Mahalakshmi et al. (2021) Not Reported 93.50% -
J48 Roy and Singh (2021) Not Reported 87.65% -
ICVAE-DNN Yang et al. (2019) 85.97% 89.08% 87.525%
GB-RBM Imamverdiyev and Abdullayeva (2018) 73.23% Not Reported -
RNN-IDS Yin et al. (2017) 81.29% Not Reported -
ID-CVAE Lopez-Martin et al. (2017) 80.10% Not Reported -
CASCADE-ANN Baig et al. (2017) Not Reported 86.40% -
DNN Tang et al. (2016) 75.75% Not Reported -
STL Javaid et al. (2016) 74.38% Not Reported -
SCDNN Ma et al. (2016) 72.64% Not Reported -
DT Moustafa and Slay (2016b) Not Reported 85.56% -
EM Clustering Moustafa and Slay (2016b) Not Reported 78.47% -
Abdullayeva (2018); Yin et al. (2017); Lopez-Martin et al.
(2017); Min et al. (2021)) and on UNSW-NB15 dataset (e.g.,
Yang et al. (2019); Baig et al. (2017); Moustafa and Slay
(2016b); Roy and Singh (2021); Mahalakshmi et al. (2021);
Min et al. (2021)) as presented in Figure 11.
In Roy and Singh (2021), the authors examine different
anomaly detection classifiers on the UNSW-NB15 dataset
before and after applying feature selection. The reported re-
sults show that the J84 classifier achieves the highest accu-
racy of 87.65%, outperforming slightly the case where no
feature selection is applied (with an accuracy of 87.44%).
However, the authors have not measured other performance
evaluation metrics (e.g., f1score, recall, and precision). The
authors of Mahalakshmi et al. (2021) applied a CNN model
on UNSW-NB15 dataset and detect anomalies with an ac-
curacy of 93.5%. However, the authors have not examined
their approach on the NSL-KDD dataset. Moreover, they
have not reported additional performance metrics (e.g., f1
score, recall, precision) during their evaluation.
In Min et al. (2021), the authors introduced MemAE by
using autoencoders to reconstruct the behavior of abnormal
samples that look close to normal ones. They achieve an
accuracy of 89.51% and f1-score of 89.93% on NSL-KDD
dataset, as well as 85.3% accuracy and 85.26% f1-score on
UNSW-NB15 dataset. However, the authors have not con-
sidered using feature selection prior to their autoencoder
anomaly detection model to show the difference between the
two scenarios.
Roy et al. (2022) propose B-Stacking, a lightweight su-
pervised intrusion detection based on machine learning en-
semble that uses K-Nearest Neighbors (KNN), Random For-
est, and XGBoost to detect network anomalies. The ap-
proach has been evaluated only on the NSL-KDD dataset,
with an accuracy of 98.50% and f1score of 99.00%, and has
not been tested on the UNSW-NB15 dataset. The authors
in Rashid et al. (2022) propose a stacking ensemble tech-
nique (SET) with SelectKBest feature selection technique
and an ensemble of Decision Trees, Random Forest, and
XGBoost machine learning models for network anomaly de-
tection. Performed experiments demonstrate that SET ob-
tains an accuracy and 𝑓1score of 94.00% and 94.00% on
UNSW-NB15 dataset, and 99.90% on both accuracy and 𝑓1
score on NSL-KDD datasets. However, the use of Selec-
tKBest technique is less adaptive to new malicious network
traffic over the time. In contrast, our proposed solution,
CHAM ELEO N, employs autoencoder model, which is more
resilient to new threats since our model uses unsupervised
techniques. Moreover, CHAMELE ON has two sub-detection
modules: supervised (XGBoost and Random Forest for clas-
sification) and unsupervised (deep autoencoders for anomaly
detection), which both module achieve promising accuracy
results. Although our main objectives it to perform anomaly
detection, obtained classification results presented in Table 5
and Table 7 demonstrate high 𝑓1scores, which outperform
Aniss Chohra et al.: Preprint submitted to Elsevier Page 16 of 19
Optimized Feature Selection for Network Anomaly Detection
the reviewed existing approaches. The results reported in Ta-
ble 10 obtained from our anomaly detection approach uses
the autoencoders, which are considered high in the context
of anomaly detection (unsupervised).
Amongst the aforementioned works, Yang et al. (2019)
deploy a combination of variational autoencoders and deep
neural networks (DNN) to detect anomalies, which achieves
the highest accuracy and 𝑓1score of 85.97% and 86.27%
on NSL-KDD, and those of 89.08% and 90.61% on UNSW-
NB15. Ma et al. (2016) combine both spectral cluster-
ing and DNN achieving 72.64% of accuracy on NSL-
KDD, and Javaid et al. (2016) deploy self-taught learning
reporting with accuracy of 74.38% on NSL-KDD. Tang
et al. (2016) employ DNN and obtain 75.75% accuracy on
NSL-KDD. Imamverdiyev and Abdullayeva (2018) deploy
a Gaussian-Bernoulli based Recurrent Boltzmann Machine
achieving 73.23% accuracy on NSL-KDD, while Yin et al.
(2017) propose a novel IDS using recurrent neural networks
(RNNs) reporting 81.29% accuracy on NSL-KDD. On the
other hand, Lopez-Martin et al. (2017) propose an intru-
sion detection system based on conditional variational au-
toencoders (CVAE), achieving 80.1% accuracy on the NSL-
KDD dataset. Baig et al. (2017) introduce a novel approach
for intrusion detection using multi-cascading artificial neural
networks achieving an accuracy of 86.4% on UNSW-NB15
dataset. Moustafa and Slay (2016b) deploy two approaches;
the first one uses expectation-maximization clustering tech-
nique in order to detect anomalies efficiently achieving
78.47% accuracy on UNSW-NB15 dataset, and the second
approach deploys decision trees on the same dataset and
records an accuracy of 85.56%. However, given all this in-
formation, we notice that our work outperforms the afore-
mentioned state-of-the-art works by achieving 90.711% of
accuracy and 92.092% f1-score on NSL-KDD dataset, and
89.523% of accuracy and 92.904% of f1-score on UNSW-
NB15 dataset.
The advantages of our approach over aforementioned ex-
isting works are as follows:
•Feature selection: where our work is amongst a few
proposed approaches (e.g., Roy and Singh (2021)) that
introduces the selection of the most important fea-
tures through PSO algorithm before applying a de-
tection model. This leads to achieving more accu-
rate model, since feature selection helps filter non im-
portant/relevant features (noisy data) from the dataset,
which leads to classify each class/label more accu-
rately and results in more accurate models. In addi-
tion, feature selection provides better efficiency and
scalability compared to existing models that use the
whole features of the datasets.
•Evaluation on recent real-world IoT dataset: while
existing works evaluated their approaches on the
most common benchmark datasets (NSL-KDD and
UNSW-NB15), none of them conduct experiments on
real-world IoT dataset. On the contrary, we first gen-
erate our own real-word IoT dataset, and then apply
our models to that real-world IoT dataset in addition
to those non-IoT datasets. This makes our approach
more realistic and applicable to recent security prob-
lems.
4.6. Efficiency
In this section, we examine the execution time for the
optimization feature selection algorithm depending on the
chosen fitness function. The obtained results are presented in
Figure 10. The execution time relatively high for the UNSW-
NB15 dataset), due to the huge number of records as well
as large number of features. However, we do not consider
this as an issue, since the optimized feature selection task is
executed only once on each dataset.
Figure 10: Optimized feature selection execution times.
5. Concluding Remarks and Limitations
Optimization of non-deterministic tasks in machine
learning and deep learning is becoming a new widespread
approach to help developers find optimal hyper-parameter
settings and use them to build their classification, regression,
or clustering models. This paper presented a novel approach
which focuses on finding the optimal hyper-parameters for
ensemble methods in order to select the important features
on a given networking dataset. The proposed approach is de-
veloped by combining ensemble methods with a swarm in-
telligence optimization algorithm (PSO). Our validation re-
sults prove that the proposed algorithm finds the optimal so-
lutions better when tuned with boosting (XGBoost) ensem-
ble techniques rather than bagging (Random Forest) ones.
Moreover, we used the optimal solutions detected by the
optimization algorithm in order to select the appropriate set
of features on each validation dataset. Using only those fea-
tures, we built and tuned an anomaly detection autoenocoder
for each one of these datasets. Obtained evaluation results
demonstrate that our anomaly detection models outperform
the most efficient state-of-the-art techniques applied on these
datasets. Additionally, it achieves reasonable and reduced
training time delays.
However, there are some limitations to our work that
need to be addressed in the future. The first one consists of
the fact that we used only two hyper-parameters for the opti-
mization algorithm when using Fandom Forests (number of
trees and test size), and three when using it with XGBoost
Aniss Chohra et al.: Preprint submitted to Elsevier Page 17 of 19
Optimized Feature Selection for Network Anomaly Detection
(number of trees, test size, and learning rate). We are cur-
rently exploring the possibility of adding (optimizing) more
hyper-parameters. In addition, we need to improve the scal-
ability (execution times) of the feature selection (optimiza-
tion) algorithm. Although this latter does not pose an issue,
since it needs to be run only once for each dataset and not
only on a regular basis. Furthermore, we have not explored
the setting of PSO hyper-parameters (𝑐1,𝑐2, and 𝑤) in an
adaptable fashion, which can also improve the search effi-
ciency; this involves the usage of some variations of PSO,
such as Adaptive Particle Swarm Optimization (APSO) Zhan
et al. (2009), in order to find the optimal settings for these
three hyper-parameters.
References
Ahmad, A., Khan, M., Paul, A., Din, S., Rathore, M.M., Jeon, G., Choi,
G.S., 2018. Toward modeling and optimization of features selection in
big data based social internet of things. Future Generation Computer
Systems 82, 715–726.
Ahmed, M., Mahmood, A.N., Hu, J., 2016. A survey of network anomaly
detection techniques. Journal of Network and Computer Applications
60, 19–31.
Ali, W., Malebary, S.J., 2020. Particle swarm optimization-based feature
weighting for improving intelligent phishing website detection. IEEE
Access 8, 116766–116780.
Alsaheel, A., Nan, Y., Ma, S., Yu, L., Walkup, G., Celik, Z.B., Zhang,
X., Xu, D., 2021. {ATLAS}: A sequence-based learning approach for
attack investigation, in: 30th USENIX Security Symposium (USENIX
Security 21).
Baig, M.M., Awais, M.M., El-Alfy, E.S.M., 2017. A multiclass cascade
of artificial neural network for network intrusion detection. Journal of
Intelligent & Fuzzy Systems 32, 2875–2883.
Bühlmann, P., 2012. Bagging, boosting and ensemble methods, in: Hand-
book of Computational Statistics. Springer, pp. 985–1022.
Chalapathy, R., Chawla, S., 2019. Deep learning for anomaly detection: A
survey. arXiv preprint arXiv:1901.03407 .
Chalapathy, R., Khoa, N.L.D., Chawla, S., 2020. Robust deep learn-
ing methods for anomaly detection, in: Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data
Mining (KDD’20), pp. 3507–3508.
Doan, M., Zhang, Z., 2020. Deep learning in 5G wireless networks-
anomaly detections, in: 29th Wireless and Optical Communications
Conference (WOCC’20), IEEE. pp. 1–6.
Dong, H., Li, T., Ding, R., Sun, J., 2018. A novel hybrid genetic algo-
rithm with granular information for feature selection and optimization.
Applied Soft Computing 65, 33–46.
Du, M., Li, F., Zheng, G., Srikumar, V., 2017. DeepLog: Anomaly detec-
tion and diagnosis from system logs through deep learning, in: Proceed-
ings of the 2017 ACM SIGSAC Conference on Computer and Commu-
nications Security (CCS’17), pp. 1285–1298.
Dutta, V., Choraś, M., Pawlicki, M., Kozik, R., 2020. A deep learning
ensemble for network anomaly and cyber-attack detection. Sensors 20,
4583.
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera,
F., 2018. Learning from imbalanced data sets. volume 10. Springer.
Ghamisi, P., Benediktsson, J.A., 2014. Feature selection based on hy-
bridization of genetic algorithm and particle swarm optimization. IEEE
Geoscience and Remote Sensing Letters (GRSL) 12, 309–313.
Gomes, H.M., Barddal, J.P., Enembreck, F., Bifet, A., 2017. A survey on
ensemble learning for data stream classification. ACM Computing Sur-
veys (CSUR) 50, 1–36.
Hamamoto, A.H., Carvalho, L.F., Sampaio, L.D.H., Abrão, T., Proença Jr,
M.L., 2018. Network anomaly detection system using genetic algorithm
and fuzzy logic. Expert Systems with Applications 92, 390–402.
Hartmann, W.M., 2004. Dimension reduction vs. variable selection, in:
International Workshop on Applied Parallel Computing (PARA’04),
Springer. pp. 931–938.
Hwang, R.H., Peng, M.C., Huang, C.W., Lin, P.C., Nguyen, V.L., 2020.
An unsupervised deep learning model for early network traffic anomaly
detection. IEEE Access 8, 30387–30399.
Imamverdiyev, Y., Abdullayeva, F., 2018. Deep learning method for denial
of service attack detection based on restricted boltzmann machine. Big
data 6, 159–169.
Javaid, A., Niyaz, Q., Sun, W., Alam, M., 2016. A deep learning approach
for network intrusion detection system, in: Proceedings of the 9th EAI
International Conference on Bio-inspired Information and Communica-
tions Technologies (formerly BIONETICS), pp. 21–26.
Jia, W.J., Zhang, Y.D., 2018. Survey on theories and methods of autoen-
coder. Computer Systems & Applications 5, 1.
Kennedy, J., Eberhart, R., 1995. Particle swarm optimization, in: Proceed-
ings of 95-International Conference on Neural Networks (ICNN), IEEE.
pp. 1942–1948.
Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J., 2019. A survey
of deep learning-based network anomaly detection. Cluster Computing
22, 949–961.
Laboratory, S.R., 2018. Malware public datasets. URL: https://mcfp.felk.
cvut.cz/publicDatasets/.
Lauzon, F.Q., 2012. An introduction to deep learning, in: 2012 11th In-
ternational Conference on Information Science, Signal Processing and
their Applications (ISSPA), IEEE. pp. 1438–1439.
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter,
C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A., 2012. A sur-
vey on filter techniques for feature selection in gene expression microar-
ray analysis. IEEE/ACM Transactions on Computational Biology and
Bioinformatics 9, 1106–1119.
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E., 2017. A survey
of deep neural network architectures and their applications. Neurocom-
puting 234, 11–26.
Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., Wang, S., 2011. An
improved particle swarm optimization for feature selection. Journal of
Bionic Engineering 8, 191–200.
Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J., 2017.
Conditional variational autoencoder for prediction and feature recovery
applied to intrusion detection in iot. Sensors 17, 1967.
Ma, Q., Sun, C., Cui, B., Jin, X., 2021. A novel model for anomaly detection
in network traffic based on kernel support vector machine. Computers
& Security 104, 102215.
Ma, T., Wang, F., Cheng, J., Yu, Y., Chen, X., 2016. A hybrid spectral
clustering and deep neural network ensemble algorithm for intrusion de-
tection in sensor networks. Sensors 16, 1701.
Mahalakshmi, G., Uma, E., Aroosiya, M., Vinitha, M., 2021. Intrusion de-
tection system using convolutional neuralnetwork on unsw nb15 dataset,
in: Advances in Parallel Computing Technologies and Applications. IOS
Press, pp. 1–8.
Marini, F., Walczak, B., 2015. Particle swarm optimization (PSO). A tuto-
rial. Chemometrics and Intelligent Laboratory Systems 149, 153–165.
Merrill, N., Eskandarian, A., 2020. Modified autoencoder training and scor-
ing for robust unsupervised anomaly detection in deep learning. IEEE
Access 8, 101824–101833.
Min, B., Yoo, J., Kim, S., Shin, D., Shin, D., 2021. Network anomaly
detection using memory-augmented deep autoencoder. IEEE Access 9,
104695–104706.
Moustafa, N., Creech, G., Slay, J., 2017. Big data analytics for intrusion de-
tection system: Statistical decision-making using Finite Dirichlet mix-
ture models, in: Data analytics and decision support for cybersecurity:
Trends, Methodologies and Applications. Springer, pp. 127–156.
Moustafa, N., Slay, J., 2015. UNSW-NB15: a comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set),
in: 2015 Military Communications and Information Systems Confer-
ence (MilCIS), IEEE. pp. 1–6.
Moustafa, N., Slay, J., 2016a. The evaluation of network anomaly detec-
tion systems: statistical analysis of the UNSW-NB15 data set and the
comparison with the KDD99 data set. Information Security Journal: A
Aniss Chohra et al.: Preprint submitted to Elsevier Page 18 of 19
Optimized Feature Selection for Network Anomaly Detection
Global Perspective 25, 18–31.
Moustafa, N., Slay, J., 2016b. The evaluation of network anomaly detec-
tion systems: Statistical analysis of the UNSW-NB15 data set and the
comparison with the KDD99 data set. Information Security Journal: A
Global Perspective 25, 18–31.
Moustafa, N., Slay, J., Creech, G., 2019. Novel geometric area analysis
technique for anomaly detection using trapezoidal area estimation on
large-scale networks. IEEE Transactions on Big Data 5, 481–494.
NASA AVIRIS Sensor, 2021. Indian Pines dataset. URL:
http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_
Scenes#Indian_Pines.
Nkenyereye, L., Tama, B.A., Lim, S., 2021. A stacking-based deep neu-
ral network approach for effective network anomaly detection. CMC-
COMPUTERS MATERIALS & CONTINUA 66, 2217–2227.
Oreski, S., Oreski, G., 2014. Genetic algorithm-based heuristic for feature
selection in credit risk assessment. Expert Systems with Applications
41, 2052–2064.
Paxson, V., 1999. Bro: a system for detecting network intruders in real-
time. Computer networks 31, 2435–2463.
Rashid, M., Kamruzzaman, J., Imam, T., Wibowo, S., Gordon, S., 2022.
A tree-based stacking ensemble technique with feature selection for net-
work intrusion detection. Applied Intelligence , 1–14.
Roy, A., Singh, K.J., 2021. Multi-classification of UNSW-NB15 dataset
for network anomaly detection system, in: Proceedings of Interna-
tional Conference on Communication and Computational Technologies,
Springer. pp. 429–451.
Roy, S., Li, J., Choi, B.J., Bai, Y., 2022. A lightweight supervised intru-
sion detection mechanism for iot networks. Future Generation Computer
Systems 127, 276–285.
Sagi, O., Rokach, L., 2018. Ensemble learning: A survey. Wiley Interdis-
ciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249.
Sheikhpour, R., Sarram, M.A., Gharaghani, S., Chahooki, M.A.Z., 2017.
A survey on semi-supervised feature selection methods. Pattern Recog-
nition 64, 141–158.
Shen, Y., Stringhini, G., 2019. Attack2vec: Leveraging temporal word em-
beddings to understand the evolution of cyberattacks, in: 28th USENIX
Security Symposium (USENIX Security 19), pp. 905–921.
Tama, B.A., Nkenyereye, L., Islam, S.R., Kwak, K.S., 2020. An enhanced
anomaly detection in web traffic using a stack of classifier ensemble.
IEEE Access 8, 24120–24134.
Tang, T.A., Mhamdi, L., McLernon, D., Zaidi, S.A.R., Ghogho, M., 2016.
Deep learning approach for network intrusion detection in software de-
fined networking, in: 2016 international conference on wireless net-
works and mobile communications (WINCOM), IEEE. pp. 258–263.
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analy-
sis of the KDD CUP 99 data set, in: IEEE symposium on Computational
Intelligence for Security and Defense Applications (CISDA’09), IEEE.
pp. 1–6.
Team, Z., 2018. Zeek an open source network security monitoring tool.
URL: https://zeek.org/.
Xie, M., Han, S., Tian, B., Parvin, S., 2011. Anomaly detection in wireless
sensor networks: A survey. Journal of Network and computer Applica-
tions 34, 1302–1325.
Xiong, P., Cui, B., Cheng, Z., 2020. Anomaly network traffic detection
based on deep transfer learning, in: International Conference on Innova-
tive Mobile and Internet Services in Ubiquitous Computing (IMIS’20),
Springer. pp. 384–393.
Xue, B., Zhang, M., Browne, W.N., 2012. Particle swarm optimization for
feature selection in classification: A multi-objective approach. IEEE
Transactions on Cybernetics 43, 1656–1671.
Yang, Y., Zheng, K., Wu, C., Yang, Y., 2019. Improving the classifica-
tion effectiveness of intrusion detection by using improved conditional
variational autoencoder and deep neural network. Sensors 19, 2528.
Yin, C., Zhu, Y., Fei, J., He, X., 2017. A deep learning approach for intru-
sion detection using recurrent neural networks. IEEE Access 5, 21954–
21961.
Zhan, Z.H., Zhang, J., Li, Y., Chung, H.S.H., 2009. Adaptive particle
swarm optimization. IEEE Transactions on Systems, Man, and Cyber-
netics, Part B (Cybernetics) 39, 1362–1381.
Aniss Chohra et al.: Preprint submitted to Elsevier Page 19 of 19