ArticlePDF Available

Chameleon: Optimized Feature Selection using Particle Swarm Optimization and Ensemble Methods for Network Anomaly Detection

Authors:

Abstract and Figures

In this paper, we propose an optimization approach by leveraging swarm intelligence and ensemble methods to solve the non-deterministic feature selection problem. The proposed approach is validated on two benchmark datasets, namely, NSL-KDD and UNSW-NB15, in addition to a third dataset, called IoT-Zeek dataset, which consists of Zeek network-based intrusion detection connection logs. We build the IoT-Zeek dataset by employing ensemble classification and deep learning models using publicly available malicious and benign threat intelligence on the Zeek connection logs of IoT devices. Moreover, we deploy and validate a deep learning-based anomaly detection model using autoencoders on each of the aforementioned datasets by utilizing the selected features obtained from the proposed optimization approach. The obtained results demonstrate that our approach outperform the existing state-of-the-art machine learning models in terms of f1 score results, with 92.092% f1 score on NSL-KDD dataset, 92.904 f1 score on UNSW-NB15 dataset, and 97.302 f1 score on IoT-Zeek dataset.
Content may be subject to copyright.
CHAMELEON: Optimized Feature Selection using Particle Swarm
Optimization and Ensemble Methods for Network Anomaly Detection
Aniss Chohraa,, Paria Shiranib,∗∗, ElMouatez Billah Karbabaand Mourad Debbabia
aSecurity Research Centre, Gina Cody School of Engineering and Computer Science, Concordia University, Montréal, Québec, Canada
bDepartment of Computer Science, Ryerson University, Toronto, Ontario, Canada
ARTICLE INFO
Keywords:
Feature Selection
Swarm Intelligence
Particle Swarm Optimization (PSO)
Ensemble Methods
Internet of Things (IoT)
Network Anomaly Detection
Deep Learning
ABSTRACT
In this paper, we propose an optimization approach by leveraging swarm intelligence and ensemble
methods to solve the non-deterministic feature selection problem. The proposed approach is validated
on two benchmark datasets, namely, NSL-KDD and UNSW-NB15, in addition to a third dataset,
called IoT-Zeek dataset, which consists of Zeek network-based intrusion detection connection logs.
We build the IoT-Zeek dataset by employing ensemble classification and deep learning models using
publicly available malicious and benign threat intelligence on the Zeek connection logs of IoT devices.
Moreover, we deploy and validate a deep learning-based anomaly detection model using autoencoders
on each of the aforementioned datasets by utilizing the selected features obtained from the proposed
optimization approach. The obtained results demonstrate that our approach outperform the existing
state-of-the-art machine learning models in terms of 𝑓1score results, with 92.092% 𝑓1score on NSL-
KDD dataset, 92.904 𝑓1score on UNSW-NB15 dataset, and 97.302 𝑓1score on IoT-Zeek dataset.
1. Introduction
Due to the emerging technologies, the large connectivity
between different devices in different ecosystems, and the
increasing rate of cyberattacks (e.g., IoT attacks increased
700% in the last two years1), security analysis of the net-
work data is an absolute need. However, providing accurate
and efficient threat detection solutions on large volume of
data becomes more challenging. On the other hand, during
the last decade, machine learning and deep learning tech-
niques have attracted tremendous attention in many fields
(e.g., anomaly detection, vulnerability assessment, natural
language processing, stock market, and weather forecast).
Therefore, training efficient and scalable machine learning
and deep learning based threat detection models became a
task of paramount importance.
There are two common and known problems that need to
be addressed to provide efficient, accurate and scalable mod-
els. (i) Selecting the appropriate setting of hyper-parameters
for the model to be trained: this task generally falls in the
non-deterministic problems class, as it might have several
solutions that give the same accuracy results; meaning that
this kind of problem accepts at least two possible solutions
(optimal solutions). (ii) Selecting the appropriate set of fea-
tures that best define the final problem. There exists lots of
features in most of the domains, which makes it time con-
Corresponding author.
Part of this work has been done during the postdoctoral fellowship of
the author at Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
a_chohra@encs.concordia.ca ( Aniss Chohra);
paria.shirani@ryerson.ca ( Paria Shirani); e_karbab@encs.concordia.ca (
ElMouatez Billah Karbab); debbabi@ciise.concordia.ca ( Mourad
Debbabi)
ORCID (s): 0000-0003-1823-713X ( Aniss Chohra); 0000-0001-5592-1518
( Paria Shirani); 0000-0003-1293-8314 ( ElMouatez Billah Karbab);
0000-0003-3015-3043 ( Mourad Debbabi)
1https://www.darkreading.com/endpoint/
iot-specific- malware-infections- jumped-700- amid-pandemic
suming to train and validate the models. Moreover, some
of those features are irrelevant due to the presence of redun-
dancy, sparsity, or lack of correlation to the problem to be
solved. Therefore, the need for methods to better filter the
irrelevant features has become a widely adopted procedure
before any model training and experimentation.
There is a palette of techniques that have been proposed
to select the most relevant features. The most common
approach is the use of ensemble methods (e.g., Sagi and
Rokach (2018); Sheikhpour et al. (2017); Lazar et al. (2012))
due to the fact that these methods provide easier explana-
tion of the variables compared to other techniques. How-
ever, sometimes it becomes difficult to know which features
are given more importance than others by the model Gomes
et al. (2017). In addition, ensemble learning techniques
combine multiple models all together in order to improve the
overall predictive capability and to decrease the overfitting
as much as possible.
There exists state-of-the-art techniques that are proposed
to deal with the non-deterministic aspect of the feature se-
lection problem. These works generally use optimization al-
gorithms to find optimal solutions and make decisions ac-
cording to a certain objective function. For instance, Ah-
mad et al. (2018) propose a feature selection approach using
artificial bee colony (ABC), and Dong et al. (2018) incorpo-
rates a hybrid genetic algorithm with granular information.
However, these approaches do not explore the usage of their
solutions on other types of datasets (e.g., intrusion detection
systems (IDS)).
On the other hand, Autoencoders Liu et al. (2017); Jia
and Zhang (2018) are a type of neural network, which
aim at reconstructing a given input into an output with the
least possible changes. Autoencoders are widely used for
anomaly detection Chalapathy and Chawla (2019); Ahmed
et al. (2016); Kwon et al. (2019); Xie et al. (2011) due to their
ability to better represent (compress) the data to a latent-
Aniss Chohra et al.: Preprint submitted to Elsevier Page 1 of 19
Optimized Feature Selection for Network Anomaly Detection
representation (bottleneck), which consists of a reduced rep-
resentation of the input. In addition, their ability to recon-
struct the input, makes them more suitable to the anomaly
detection task; the anomalies can be detected by comparing
the reconstructed output with the input, and then flag it as
anomaly if there are any deviations from the input.
Moreover, there exists different works that are proposed
to detect anomalies using deep learning models, e.g., Du
et al. (2017); Merrill and Eskandarian (2020); Hwang et al.
(2020); Dutta et al. (2020); Xiong et al. (2020); Doan and
Zhang (2020); Chalapathy et al. (2020). However, to the best
of our knowledge, none of them explores the effects of fea-
ture selection before applying the anomaly detection model.
Problem Statement: Using all the features present in the
input data to training autoencoders can be quite troublesome
and time consuming, especially where the input data con-
tains millions of records, making the experimentation and
model engineering more complicated and difficult. More-
over, autoencoders focus mainly on feature engineering and
extraction rather than feature selection. In other words, by
transforming the input data into a compressed representa-
tion, autoencoders are able to reduce the dimensionality of
the data and learn a smaller representation. In the case of
large number of features, explaining and understanding the
compressed data is difficult, whilst feature selection identi-
fies the most useful and relevant features that best describe
and define a given ground truth variable Hartmann (2004).
Generally, during the feature selection process, choos-
ing the appropriate set of hyper-parameters is quite chal-
lenging. This is due to the fact that it is a non-deterministic
problem, which can have multiple optimal solutions; all of
them would give the same performance and accuracy results.
Thus, even after exhaustive experimentation, there is no ev-
idence to prove that (i) all the possible solutions have been
explored, and (ii) a particular solution is the best one.
Key Idea: In this context, we propose a novel approach,
called CHAMELEON, which combines both swarm intelli-
gence and ensemble learning techniques to select the optimal
settings (hyper-parameter selection for the ensemble models
as well as selection of most relevant features for each in-
dividual dataset) for feature selection task. The proposed
approach uses ensemble learning classifiers as a fitness and
evaluation function for each individual/particle within the
population/swarm. This population aims to converge to the
optimal solutions in an iterative process, where in each it-
eration, each individual tries to move closer to the optimal
solutions. Afterwards, we use the selected features given by
the optimal ensemble model to construct an anomaly detec-
tion autoencoder; we iteratively improve the model until it
outperforms the state-of-the-art models.
Contributions: The main contributions of our work are
summarized as follows:
Novel feature selection for network datasets: We pro-
pose a feature selection approach for network datasets
that leverages both swarm intelligence and ensemble
methods to select the most relevant features. The en-
semble methods are used as a fitness function within
the optimization approach in order to leverage their
ability to better interpret and select the independent
features.
Training time improvement: We employ the selected
features obtained from the optimization step and de-
ploy deep learning models for network anomaly detec-
tion. The feature selection process considerably im-
proves the training time compared to the case where
all features are used.
Malicious and benign dataset generation: We setup
an environment and generate a dataset called IoT-Zeek
dataset from PCAPS and connection logs using Zeek
NIDS Team (2018). Then, we introduce an ensem-
ble model leveraging classical machine learning and
deep learning classifiers in order to learn malicious be-
haviour on the generated network traffic and classify
network logs into malicious or benign connections.
Evaluation: We evaluate our proposed approach on
different datasets (i.e., IoT dataset: IoT-Zeek, and
non-IoT datasets: NSL-KDD and UNSW-NB15) and
demonstrate its efficiency and performance. In addi-
tion, performed experiments on the selected features
obtained from the optimal solution for each dataset in-
dicate that our proposed model outperforms existing
works.
This paper is organized as follows. The most rele-
vant state-of-the-art works are discussed in section 2. An
overview of the proposed approach along with the method-
ologies are presented in section 3. The evaluation results are
provided in section 4. The limitations of our approach along
with the concluding remarks are presented in section 5.
2. Related Work
In this section, we present the most relevant and impor-
tant works that have been proposed for: (i) feature selec-
tion using optimization algorithms and (ii) anomaly detec-
tion and maliciousness fingerprinting using machine learn-
ing and deep learning models.
2.1. Feature Selection Using Optimization
Ahmad et al. (2018) propose a feature selection approach
using Artificial Bee Colony (ABC). In addition, they in-
tegrate a Kalman filter2alongside Hadoop ecosystem3for
noise removal. The system is validated on ten datasets and
compared with swarm intelligence approaches. However,
the authors have not applied their approach on IDS datasets.
Dong et al. (2018) propose a technique for feature se-
lection, which incorporates a hybrid genetic algorithm with
2http://web.mit.edu/kirtley/kirtley/binlustuff/literature/
control/Kalman%20filter.pdf
3https://hadoop.apache.org/
Aniss Chohra et al.: Preprint submitted to Elsevier Page 2 of 19
Optimized Feature Selection for Network Anomaly Detection
granular information. This technique is tested on 11 bench-
mark financial datasets and has been compared with cer-
tain state-of-the-art techniques. The obtained results demon-
strate that it achieves high classification accuracy. However,
their work does not explore the usage of the approach on
other types of datasets (e.g., IDS and network dataset).
In Xue et al. (2012), a novel feature selection approach is
proposed for classification, where the feature selection task
is considered as a non-deterministic problem. The authors
investigated two types of multi-objective particle swarm op-
timization algorithms (PSO). The first one leverages the con-
cept of non-dominated sorting in the feature selection prob-
lem. Whilst the second one introduces more evolutionary
concepts (mutation and crossover) to search for better opti-
mal solutions. These two algorithms were then compared
with two standard feature selection techniques and then val-
idated on twelve benchmark datasets. However, they did not
explore the usage of more complex fitness functions.
A novel approach for feature selection is proposed is Liu
et al. (2011), which combines multi-swarm particle swarm
optimization (MSPSO) and support vector machines (SVM)
as fitness function, with 𝑓1score being the fitness value. The
goal was to execute both kernel optimization and feature se-
lection simultaneously in order to get better generalization.
The proposed approach was then compared with state-of-
the-art feature selection algorithms using PSO, genetic algo-
rithm (GA), and grid search, using ten UCI (UC Irvine)4ma-
chine learning benchmark datasets for validation. The evalu-
ation results show that their novel technique outperforms the
three aforementioned techniques. However, the proposed al-
gorithm is only specific to the datasets used for validation
and has not been tested on the network IDS datasets.
In Ghamisi and Benediktsson (2014), the authors pro-
posed a feature selection approach which combines both
GA and PSO algorithms, where SVMs are used as fitness
function and the accuracy metric as fitness value. The
proposed technique was validated on Indian Pines Spectral
dataset NASA AVIRIS Sensor (2021) and the results show
that the approach can select the most relevant features that
allow higher accuracy results for classification. However,
the authors did not present an exhaustive study on bench-
mark datasets neither a comparative study with state-of-the-
art techniques. Moreover, the proposed solution is only lim-
ited to the utilized dataset.
A novel approach for feature selection with combining
genetic algorithm with neural networks (HGA-NN) intro-
duced in Oreski and Oreski (2014). The approach was ap-
plied to real-world credit dataset collected from the Croat-
ian Bank, and furthermore evaluated on a benchmark credit
dataset selected from UCI database. Finally, this technique
was compared to existing classification works in terms of ac-
curacy results and showed that it outperforms them. How-
ever, we find that this technique focuses more on the accu-
racy rather than 𝑓1score, and has only been applied to UCI
datasets.
4https://archive.ics.uci.edu/ml/datasets.php
2.2. Deep Learning and Anomaly Detection
In Tama et al. (2020), the authors present a novel
anomaly detection system on web applications by propos-
ing a stacked ensemble by combining other ensemble mod-
els (e.g., random forests, xgboost). Four datasets (CSIC-
2010v2, CICIDS-2017, NSL-KDD, UNSW-NB15) were
used for the validate of their approach. The obtained results
show that the proposed stacked model outperforms exist-
ing web attacks detection solutions in terms of accuracy and
false positive rate (FPR) metrics. However, the authors have
not preformed a scalability and complexity study of their
approach; especially for two large datasets (UNSW-NB15
and CICIDS-2017). Nkenyereye et al. (2021) also proposed
a stacked-based model for anomaly-based intrusion detec-
tion systems; where the based learners/models are basically
deep neural networks (DNN). Their approach is then val-
idated on benchmark datasets (NSL-KDD, UNSW-NB15,
and CICIDS-2017) and evaluated using several metrics in-
cluding the accuracy, false positives rate, and Matthew’s
Correlation Coefficient. The obtained results prove that
their model outperforms simple DNN-based anomaly model
in addition to some state-of-the-art techniques (by achiev-
ing 89.97%, 92/83%, and 99.65% on the three aforemen-
tioned benchmark datasets respectively). However, they
have not preformed a scalability study of their model on
these datasets.
In Hamamoto et al. (2018), the authors present a novel
approach for anomaly detection which combines both ge-
netic algorithm and fuzzy logic. More specifically, the ge-
netic algorithm is deployed in order to better represent fin-
gerprints of network segments using network flow data. This
also allows to predict network traffic behaviours for specific
and pre-defined time windows. Then, fuzzy logic is used to
decide whether there are some anomalous behaviours within
those time-windows. Their approach was validated and eval-
uated on real-world network traffic data and it has proven to
be effective by achieving 96.53% of accuracy and 0.56% of
false positives rate.
Ma et al. (2021) proposed a novel approach for anomaly
detection on network traffic data, called SVM-L, which com-
bines both SVM and Linear Discriminant Analysis (LDA).
More specifically, URLs from the data are used as input and
converted into vector format using natural language process-
ing (NLP) and statistical techniques. Then, these vectors are
fed to the SVM model in order to classify them into anoma-
lies or normal. In addition, the authors used an optimization
algorithm in order to optimize the hyper-parameters of the
SVM classifier. The validation results of the SVM-L model
shows that it achieves 99% of accuracy on the tested datasets.
There exist several solutions (e.g., Alsaheel et al. (2021),
Shen and Stringhini (2019)) that improve the results of the
maliciousness segregation using advanced machine learning
and NLP techniques on log files. For instance, Shen and
Stringhini (2019) propose attack2vec to detect emerging net-
work attacks by leveraging dynamic word embeddings tech-
niques. Similar to NLP word embeddings, their approach
produces a dense representation of the security events while
Aniss Chohra et al.: Preprint submitted to Elsevier Page 3 of 19
Optimized Feature Selection for Network Anomaly Detection
considering the time factor. Moreover, in Alsaheel et al.
(2021), the authors propose Atlas, a framework for attack in-
vestigation that leverages NLP and deep learning techniques
to segregate attacks and non-attacks using logs as input. At-
las begins with processing the logs and building a causal de-
pendency graph between the events found in the logs. This
graph is augmented using NLP techniques and used to train
a sequence-base model that represents the attack semantics.
The produced models help the cyber analyst identify key at-
tack steps that share similarities with previous patterns. On
the contrary, our proposed IoT real-world dataset genera-
tion (presented in subsection 3.4) fingerprints malicious logs
from the IoT network traffic data by leveraging an ensem-
ble model constructed using several models/classifiers (e.g.,
Random Forests, XGBoost, CatBoost, NN, and CNN).
In Roy and Singh (2021), the authors present a study of
different anomaly detection classifiers before and after ap-
plying feature selection. More specifically, the authors com-
pare different machine learning classifiers by training each
model twice. In the first iteration, they use all the exist-
ing features from the dataset. During the second iteration,
they first tune the classifier with several feature selection al-
gorithms; then they select the feature selection algorithm
which gives the best accuracy results, and use the selected
features with the same classifier as for the first training it-
eration. The reported results show that the J84 classifier
achieves the highest accuracy.
The authors of Mahalakshmi et al. (2021) applied a con-
volutional neural network (CNN) model to detect anomalies
efficiently. Obtained result show that their proposed CNN
model achieves an accuracy of 93.5%. However, the authors
have not compared their work with any other state-of-the art
approaches.
In Min et al. (2021), the authors introduced a novel
network anomaly detection technique, called memory-
augmented deep auto-encoder (MemAE). Autoencoder was
used to reconstruct the behavior of abnormal samples that
look close to normal ones; thus, the authors are solving the
problem of over-generalization, which occurs with abnormal
samples on autoencoders.
Roy et al. (2022) propose a lightweight intrusion detec-
tion system, called B-Stacking, based on supervised ma-
chine learning. A series of feature transformations, dimen-
sionality reduction and feature selection methods are applied
to produce the learning features. Afterwards, the authors
propose B-Stacking, a machine learning ensemble that uses
K-Nearest Neighbors (KNN), Random Forest, and XGBoost
to detect network anomalies. The system is claimed to be
lightweight and targets IoT devices, however, the experi-
ments has been carried out on Intel Core i5-9400F CPU 2.90
GHz notebook with 8GB of RAM and the system consumes
3.4% of the RAM and 1.5% to 2.9% of the CPU in this high-
end notebook machine, which is considered highly demand-
ing for an IoT device. In addition, the detection run-time has
not been reported.
The authors in Rashid et al. (2022) propose a stacking en-
semble technique (SET) with SelectKBest feature selection
technique for network anomaly detection. First, dimension-
ality reduction and features selection are applied to segre-
gate relevant features. Next, an ensemble of Decision Trees,
Random Forest, and XGBoost machine learning models are
employed to detect anomalies. However, the use of Selec-
tKBest technique is less adaptive to new malicious network
traffic over the time.
3. Materials and Methods
In this section, we first provide background on the re-
lated topics, then we present an overview of our approach.
Next, we provide details on the proposed methodologies for
feature selection and anomaly detection. Finally, we present
our approach to generate the IoT-Zeek dataset.
3.1. Background
In the following, we provide an overview on Particle
swarm optimization (PSO) and ensemble methods.
3.1.1. Particle Swarm Optimization
Particle swarm optimization (PSO) (Kennedy and Eber-
hart (1995)) is a stochastic and meta-heuristic optimization
algorithm, which was first inspired by the social behaviour
of some animals (e.g., birds and fishes). In the PSO algo-
rithm, the population of individuals is referred to as swarm,
and each individual within the swarm is referred to as par-
ticle. These particles try to find the set of optimal solu-
tions to a given problem by constantly updating their posi-
tions according to their own performance, which is called
cognitive aspect, and the current overall performance of the
swarm is called social aspect. Thus, PSO is based on two
essential logic: cooperation/collaboration and competition,
where the former represents the ability of one particle to
communicate with other particles in order to collaborate
their efforts towards the optimal solutions, whilst the latter
represents one particle’s desire to use its own performance
and move towards the possible solution.
Moreover, each particle is defined within a search space,
which represents the set of hyper-parameters to be optimized
for the solution. Depending on the swarm’s global solution,
each particle computes the cognitive aspect and the social
aspect according to Equation 1 and Equation 2, respectively,
as follows:
𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 =𝑐1 × 𝑟1×(𝑝𝑜𝑠_𝑏𝑒𝑠𝑡𝑖𝑝𝑜𝑠𝑖)(1)
𝑠𝑜𝑐𝑖𝑎𝑙 =𝑐2 × 𝑟2×(𝑝𝑜𝑠_𝑔𝑙𝑜𝑏𝑎𝑙 𝑝𝑜𝑠𝑖)(2)
where 𝑐1and 𝑐2are called acceleration constants and de-
fine the speed at which a particle should move towards the
optimal solutions (𝑐1defines the speed at which the parti-
cle should converge to its local solution, whilst 𝑐2defines
the speed of convergence of the whole swarm towards the
global solution), 𝑟1and 𝑟2are two randomly generated val-
ues to control the stochastic influence of the cognitive and
social components on the overall velocity of a particle, 𝑝𝑜𝑠𝑖
represents the position of a particle at iteration i, 𝑝𝑜𝑠_𝑏𝑒𝑠𝑡𝑖
represents the local optimal solution found by that particle
Aniss Chohra et al.: Preprint submitted to Elsevier Page 4 of 19
Optimized Feature Selection for Network Anomaly Detection
so far, and 𝑝𝑜𝑠_𝑔𝑙𝑜𝑏𝑎𝑙 represents the position of the global
solution found by the entire swarm so far.
Afterwards, each particle updates its velocity using
Equation 3, where 𝑡represents the current particle, 𝑣𝑖(𝑡)rep-
resents the velocity of the current particle at iteration 𝑖, and
𝑤is the inertia weight (importance) given for that veloc-
ity (the smaller the weight 𝑤, the stronger the convergence
towards the global optimum). Finally, the position of each
particle is updated using Equation 4.
𝑣𝑖(𝑡+ 1) = 𝑤×𝑣𝑖(𝑡) + 𝑠𝑜𝑐 𝑖𝑎𝑙 +𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 (3)
𝑝𝑖(𝑡+ 1) = 𝑝𝑖(𝑡) + 𝑣𝑖(𝑡)(4)
3.1.2. Ensemble Methods
Ensemble methods are a type of machine learning mod-
els, which consist of a combination of multiple classifier-
s/predictors in order to improve the performance of the over-
all classification/prediction. In other words, ensemble meth-
ods combine the decisions made by multiple models using
techniques such as: majority voting, average, and weighted
average. Moreover, these techniques provide easier interpre-
tation of features and better predictive performance with less
overfitting compared to other machine learning techniques.
These family of machine techniques is generally classified
into two major types (Bühlmann (2012)), which are pre-
sented in Figure 1, as follows:
1. bagging, where all the used predictors are running
in parallel and independently, these models are then
combined using an aggregation technique to make the
final decision. An example of such types are ran-
dom forests, where a sample called bootstrap is se-
lected randomly and fed to one model. Therefore, each
model in the forest will have a different observation
and thus leading to no correlation between these pre-
dictors, making them less prune to overfitting.
2. boosting, deploys the paradigm of learning from each
one’s predecessor’s errors/mistakes (called residuals).
Therefore, these types of ensemble methods are exe-
cuted in a sequential order, which gives them an ad-
vantage over the first type consisting of less training
time delays. An example of ensemble techniques in-
cludes gradient boosting technique.
3.2. Approach Overview
In order to identify anomalous connections, we propose
a deep-learning based autoencoder anomaly detection. The
input to this model is a set of features obtained from the net-
work traffic connection logs. We propose a hybrid model
consisting of PSO algorithm and ensemble methods to iden-
tify the most relevant set of features for any given dataset.
During this process, we explore two types of fitness func-
tions (models); the first one belongs to the bagging ensem-
ble methods family (Random Forests), and the second one
belongs to the boosting method (gradient boosting).
Predictor 1
Predictor 2
Predictor 3
Bootstrap 1
Bootstrap 2
Bootstrap 3
Bagging Vs.
Bootstrap 1 Predictor 1
Bootstrap 2 Predictor 2
Bootstrap 3 Predictor 3
Boosting
Figure 1: Bagging vs. Boosting Ensemble Methods.
1. Search Space
Definition
2. Fitness and
Objective
Functions
Definition
3. Algorithm
Initialization 4. Iterative Search
5. Optimal
Solutions
Selection
Input
Dataset
PSO
Selected
Features
6.Deep
Learning
Anomaly
Detection
Optimized Feature Selection
Figure 2: Approach Overview.
An overview of our approach is represented in Figure 2.
Feature selection can be viewed as five sequential steps: fit-
ness and objective function definition,search-space defini-
tion,algorithm initialization,iterative search, and optimal
solutions selection. The proposed approach starts by defin-
ing the search space (Step 1) for PSO (Kennedy and Eber-
hart (1995); Ali and Malebary (2020)) depending on the
chosen fitness function (Step 2). Afterwards, it takes as in-
put any labeled dataset and initializes a fixed size popula-
tion/swarm by generating random particles (Step 3). Given
a precise number of iterations, each particle will then try
to find the optimal position of the solution by updating and
changing constantly its position within the search space ac-
cording to the performance of the swarm and its own per-
formance (Step 4). The goal of the swarm is to find the op-
timal model (optimal hyper-parameters) which maximizes
certain performance metrics (Step 5). Finally, we consider
only the set of best fitting settings (e.g., hyper-parameters),
which give us the highest accuracy metrics. We then use
these settings to build the final model(s) in order to extract
the most relevant features. Afterwards, we leverage the se-
Aniss Chohra et al.: Preprint submitted to Elsevier Page 5 of 19
Optimized Feature Selection for Network Anomaly Detection
lected features discovered during the optimization part and
engineer an anomaly detection model using deep learning
autoencoders (Step 6) Lauzon (2012). Our goal is to get an
anomaly detection model which outperforms existing mod-
els in terms of 𝑓1score metric.
3.3. Methodology
In this section, we provide more details on the proposed
methodology.
3.3.1. Optimized Feature Selection
Our feature selection algorithm can be performed as five
sequential steps. The algorithmic description of the opti-
mized feature selection is presented in Algorithm 1 and Al-
gorithm 2. In what follows, we explain each step in detail.
Algorithm 1 Feature Selection: Algorithmic Description
Input: 𝐷 Input dataset
Output: 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠
1: global variables
2: 𝑐1,𝑐2cognitive and social acceleration constants, respectively
3: 𝑟1,𝑟2cognitive and social random factors, respectively
4: 𝑤,velocity’s weight
5: 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒,
6: 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,
7: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠,global solution fitness value
8: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛,global solution’s position
9: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠_𝑙 𝑖𝑠𝑡,optimal solutions positions and fitness values
10: end global variables
11: 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 DEFINE_F ITNES S
12: 𝑏𝑜𝑢𝑛𝑑𝑠 DE F_SEARC H_SPACE(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛)
13: (𝑠𝑤𝑎𝑟𝑚, 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒, 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠, 𝑐 1, 𝑐2, 𝑟1, 𝑟2, 𝑤)ALGO RITHM _INIT
14: for each 𝑖𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 do Iterative search
15: for each 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑘𝑠𝑤𝑎𝑟𝑚 do
16: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 EVALUATE_FITNE SS(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘)
17: if 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 > 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 then
18: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑓 𝑖𝑡𝑛𝑒𝑠𝑠
19: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
20: 𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠_𝑙𝑖𝑠𝑡 += 𝑔 𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
21: 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
22: end if
23: UPDATE_V ELOCIT Y(c1, c2, r1, r2, w, global_position, particle_k_position,
personal_best_position, particle_k_velocity)
24: UPDATE_PO SITION (particle_k_velocity, particle_k+1_position)
25: end for
26: end for
27: 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑡𝑖𝑜𝑛𝑠 𝑚𝑎𝑥(𝑓1_𝑠𝑐𝑜𝑟𝑒, 𝑔 𝑙𝑜𝑏𝑎𝑙_𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠_𝑙𝑖𝑠𝑡)Fitness and
objective function definition,Optimal solutions selection
28: return 𝑜𝑝𝑡𝑖𝑚𝑎𝑙_𝑠𝑜𝑙𝑡𝑖𝑜𝑛𝑠
Fitness and objective functions: First, we define the fitness
function to be used to evaluate the performance of each par-
ticle within the swarm. We choose ensemble methods clas-
sifiers due to their advantages and benefits, which include
low overfitting and high accuracy. Each particle is fed to
the classifier which in turn will automatically adapt to it and
will be trained on the dataset accordingly. At the end of this
process, the fitness function returns the following evaluation
metrics: accuracy, recall, precision, and 𝑓1score.
Next, we define the objective function to be satisfied by
the set of possible optimal solutions (evaluate the whole so-
lutions). The objective function helps filter a set of results
and keep only those that satisfy our needs. In this work, since
we integrate ensemble models as evaluation/fitness func-
tions, we should select a metric which best describes the
performance of these models at each particle level. From
the above-mentioned evaluation metrics, we choose the lat-
ter one (𝑓1score), since it represents the weighted aver-
Algorithm 2 Feature selection: Functions Definitions
1: procedure DEFINE _FITNE SS Fitness and objective function definition
2: 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒_𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑟 tuned between random forests
and gradient boosting
3: return 𝑓𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛
4: end procedure
5: procedure DEF_SE ARCH_S PACE(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛)Search-space definition
6: if 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 == 𝑟𝑎𝑛𝑑𝑜𝑚_𝑓 𝑜𝑟𝑒𝑠𝑡 then
7: return 𝑏𝑜𝑢𝑛𝑑𝑠 [(0.1,0.4),(50,1000)]
8: else if 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡_𝑏𝑜𝑜𝑠𝑡𝑖𝑛𝑔 then
9: return 𝑏𝑜𝑢𝑛𝑑𝑠 [(0.1,0.4),(50,1000),(0.1,0.3)]
10: end if
11: end procedure
12: procedure ALGOR ITHM_IN IT Algorithm initialization
13: 𝑐1[1,2] c1 is tuned using two different values: 1 and 2
14: 𝑐22
15: 𝑤0.5
16: 𝑟1𝑟𝑎𝑛𝑑𝑜𝑚,𝑟2𝑟𝑎𝑛𝑑 𝑜𝑚
17: 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒 15
18: 𝑠𝑤𝑎𝑟𝑚 𝑟𝑎𝑛𝑑𝑜𝑚(𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒𝑠, 𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒)
19: 𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 30
20: return 𝑠𝑤𝑎𝑟𝑚,𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒 ,𝑚𝑎𝑥𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,𝑐1,𝑐2,𝑟1,𝑟2,𝑤
21: end procedure
22: procedure EVALUTE_FI TNESS(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒)
23: 𝑓 𝑖𝑡𝑛𝑒𝑠_𝑣𝑎𝑙𝑢𝑒 𝑓1_𝑠𝑐𝑜𝑟𝑒(𝑓 𝑖𝑡𝑛𝑒𝑠𝑠_𝑓 𝑢𝑛𝑐 𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐𝑙 𝑒)
24: return 𝑓𝑖𝑡𝑛𝑒𝑠_𝑣𝑎𝑙 𝑢𝑒
25: end procedure
26: procedure UPDATE_VE LOCITY (𝑐1, 𝑐2, 𝑟1, 𝑟2, 𝑤, 𝑔𝑙𝑜𝑏𝑎𝑙 _𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛,
𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦)
27: 𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒 =𝑐1 𝑟1∗(𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙_𝑏𝑒𝑠𝑡_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)
28: 𝑠𝑜𝑐𝑖𝑎𝑙 =𝑐2 𝑟2∗(𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)
29: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦 =𝑤𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦 +𝑠𝑜𝑐𝑖𝑎𝑙 +𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒
30: return 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑣𝑒𝑙𝑜𝑐 𝑖𝑡𝑦
31: end procedure
32: procedure UPDATE_PO SITION(𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, 𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)
33: update particle position:
34: 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 =𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 +𝑝𝑎𝑟𝑡𝑖𝑐 𝑙𝑒_𝑘_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
35: return 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒_𝑘+ 1_𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
36: end procedure
age of the precision and recall, taking both false positives
and false negatives into account, contrary to the accuracy
which takes only the true positives and true negatives into
account. Moreover, 𝑓1metric considers uneven or unbal-
anced datasets, where the target classes are not balanced. In
this case, our objective is to consider only the settings of the
models, which give us the highest values of 𝑓1score. Thus,
we define the objective function to be the maximization of
these values. This process will also prevent our algorithm
from falling into the local optimum and converge to a global
one.
Search space: Since we are using ensemble methods as par-
ticles’ evaluation/fitness function, we should choose the ap-
propriate set of hyper-parameters to be fed to the models.
Depending of the type of the model, i.e., bagging vs boost-
ing Bühlmann (2012), we select the most relevant hyper-
parameters that are of high importance to the learning pro-
cess of the learner/model (e.g., number of trees/estimators,
respective sizes for each of the training and testing splits,
etc.). This set of hyper-parameters will define the dimen-
sional space used by our algorithm in order to search for
possible optimal solutions.
For bagging techniques, there are two major types of
hyper-parameters that need to be investigated and optimized,
namely, test data size and number of estimators (trees). The
first one defines the size of testing data on which the model
should be tested, and consequently the size of training data
will be deduced automatically. It is generally recommended
Aniss Chohra et al.: Preprint submitted to Elsevier Page 6 of 19
Optimized Feature Selection for Network Anomaly Detection
to set testing data size smaller than that of training set (be-
tween 10% and 40%), thus we set the lower bound as 10%
and upper bound as 40%. This hyper-parameter will be de-
fined as the first dimension of each particle and based on
it, the fitness function will decide on how to split the input
dataset and train the appropriate ensemble model.
The second hyper-parameter that needs to be optimized
is the number of estimators, which represents the number of
decision trees that are part of the ensemble learning model.
Normally, the bigger the number of trees is, the better the
overall ensemble model will perform. However, there is a
limitation to this; at some point this improvement stops and
will start decreasing and result in badly predicted samples
and even overfitting. In addition, the bigger the number of
trees, the more computational cost is incurred, making the
experimentation more difficult for large-scale datasets. In
general practice, this hyper-parameter is decided with ex-
haustive experimentation by initiating the number of trees
with the smallest value, and at each iteration increasing it
slightly to improve the model’s performance compared to
the previous experimentation results. However, finding the
optimal number of estimators is very time consuming, es-
pecially in the case of large datasets which leads to days or
even months of experimentation. Moreover, this approach
does not exhaustively explore all the possible optimal val-
ues for the hyper-parameters; it is mainly performed based
on the knowledge and experiences of the experts. There-
fore, we propose to integrate this parameter within the opti-
mization algorithm as a hyper-parameter to be optimized for
the global solution. To improve the scalability of the opti-
mization algorithm, we choose this hyper-parameter to have
a value between 50 (lower bound) and 1000 (upper bound).
In boosting methods, new trees are added to the model in
order to correct the mistakenly predicted samples (residuals)
by the previous tree. This process has two effects; the first
one, which can be considered as a benefit, consists of faster
training times compared to bagging techniques. The second
one can be considered as a disadvantage, which makes the
model being more prune to overfitting. To overcome this
problem, the learning rate can be seen as a weight (percent-
age), which is introduced to control and reduce the number
of corrections to be made by the current tree (e.g., predic-
tor/classifier) from the previous one. As a result, the overall
performance of the model is improved when the learning rate
is much smaller and the number of trees is higher. There-
fore, in addition to the above-mentioned hyper-parameters,
the boosting ensemble methods require the third essential
hyper-parameter, learning rate, to be optimized. In gen-
eral practice, it is recommended to define this parameter to
a value between 0.1(lower bound) and 0.3(upper bound).
Thus, our optimization algorithm’s search space for the
bagging methods is defined as a 2-dimensional space, where
the first dimension represents the test size, and the second
one is the number of decision trees/estimators included in the
ensemble model. For the boosting methods, our search space
is defined as a 3-dimensional space, where each particle has
three parameters: (test size, number of trees, learning rate).
Algorithm initialization: We start by initializing the set-
tings for our PSO algorithm. First, we define the maximum
number of iterations, which can be viewed as the number
of chances given to the swarm in order to find the optimal
solutions. This parameter is primordial and essential since
in optimization problems we only know that the problem to
be solved might have multiple optimal solutions. However,
we do not know the exact number of these optima; if the
number of chances is too large, the algorithm in question
can take tremendous amounts of times. On the other hand,
the performance of the optimization algorithm to find more
optimal solutions gets better when the number of iterations
increases. However, to limit the time consumption factor,
we fix this setting to a value of 30 iterations.
Moreover, we need to define the values for the accel-
eration constants (𝑐1and 𝑐2) and the weight (𝑤)Kennedy
and Eberhart (1995). For the first ones, it is recommended
to set them such that their product is between 0and 4(0
𝑐1 × 𝑐24)Marini and Walczak (2015). We run the algo-
rithm two times; the first time we set these two constants to
equal values (set both to 2), whilst in the second execution,
we give more importance (speed) to the global solution by
setting 𝑐2to 2and 𝑐1to 1. The intuition is to start by giving
the same importance to the local and global solutions, then
increase the importance of the global solution (𝑐2) and check
which setting allow us to explore better optimal solutions.
Next, we initialize the swarm (population of particles)
by first defining a fixed number of particles (individuals)
(𝑠𝑤𝑎𝑟𝑚_𝑠𝑖𝑧𝑒), which consists of the swarm. For each par-
ticle, we randomly generate its respective velocity and po-
sition such that they are selected within our pre-defined
bounds (search space definition). We initialize the global
fitness value (𝑓1) of the overall swarm to be equal to 0.5.
Iterative search: During the iterative process, each initi-
ated particle in the initial swarm gets evaluated using the
fitness function (ensemble model classifier) using its own
coordinates. If after the particle’s fitness evaluation, the 𝑓1
score of that particle is found to be greater than the global
(swarm’s) 𝑓1score, then it first updates the global 𝑓1score
to its own value, and sets the global solution’s position to its
own position. Next, it updates its position (particle) using
the appropriate velocity and position functions presented in
Algorithm 2 (line 29 and line 34, respectively).
As for the time complexity of this process, it is of the
order of 𝑂(𝑛𝑚); where 𝑛represents the maximum number
of iterations (line 14 in Algorithm 1) and 𝑚represents the
population/swarm size (line 15 in Algorithm 1). However,
since in our experiments 𝑛and 𝑚have small values (the max-
imum number of iterations is equal to 30 and the size of the
swarm is equal to 15), our approach does not encounter high
time complexity issue. As for the fitness function (line 16
in Algorithm 1), the (training) time complexity of the mod-
els (i.e., Random Forests or XGBoost) is of the order of
𝑂(𝑘.𝑣.𝑛.𝑙𝑜𝑔(𝑛)), where 𝑘is the number of trees, 𝑣is the num-
ber of features, and 𝑛is the number of records/rows. Due to
the presence of a bottleneck in our algorithm for evaluat-
ing the fitness of each particle (either by training Random
Aniss Chohra et al.: Preprint submitted to Elsevier Page 7 of 19
Optimized Feature Selection for Network Anomaly Detection
Forests or XGBoost models), we leverage multiprocessing
paradigm by taking advantage of 20 CPU cores of our setup
environment. On the other hand, since we use a server with
128 GB of RAM (presented in subsection 4.1), the space
complexity does not consist a bottleneck in our algorithm.
Therefore, our approach is sufficiently efficient and scalable
on the studied datasets and their respective models. Per-
formed experiments (reported in subsection 4.6) confirm the
scalability and efficiency of our proposed approach.
It is worth noting that one particle can fall into the case
where the second dimension (number of trees) is not an in-
teger value. This is problematic due to the fact that the num-
ber of decision trees making the ensemble model ought to
be an integer value. Therefore, in that case, we round the
value to the closest integer value. Furthermore, if at any it-
eration, a particle novel position is found to be outer of the
search-space bounds (e.g., [0.1,0.4] for first dimension and
[50,1000] for the second one), we correct the out of bound
value to the closest bound. For instance, if a new particle’s
position is (0.5,1200), we correct it to position (0.4,1000).
Optimal solutions: Finally, after the maximum iterations
are reached, we apply a maximization function, which takes
as input all the possible solutions explored during the itera-
tive search and returns only the ones with the highest fitness
value (𝑓1score). If more than one optimal solution is found
(giving the same 𝑓1score value), we select the one with a set
of hyper-parameters that induce the best efficiency (e.g., ex-
ecution time and CPU usage). Then, the appropriate model
using the selected optima’s hyper-parameters is trained and
only those features with importance values higher than the
average of all features importance are selected for the next
phase (e.g., anomaly detection).
3.3.2. Deep Learning-Based Anomaly Detection
After selecting the optimal feature selection model and
using it to select the most relevant features, we use the fil-
tered dataset using selected features to generate an efficient
anomaly detection model using autoencoders. To this end,
we start by taking the most accurate models for that dataset
from the existing state-of-the-art models. We aim at re-
ducing our search for the appropriate model selection by
using the most efficient one proposed as a starting point.
Then, we feed the model with only the selected features of
the dataset, which will help reduce the autoencoder model’s
training time. Since we do not use all the features, thus the
compression and bottleneck generation (encoder) as well as
the reconstruction phase (decoder) will run faster compared
to the case of feeding all the features as input.
There are multiple hyper-parameters that we need to tune
in order to find the optimal autoencoder model: batch size,
loss function,number of layers,number of neurons, and reg-
ularizations. Once we reach a point where our model outper-
forms the state-of-the-art models (e.g., Yang et al. (2019)),
we stop the search algorithm and select that model as the
optimal one. We then use l1_norm to compute the distance
between the input and the reconstruction data. This result is
then compared to the input labels (ground truth) and differ-
Listing 1: Malicious and Benign Traffic Logs Sources
−−−−−−−−−−−−−−−−− Malicious Traffic Logs Sources −−−−−−−−−−−−−
https://mcfp.felk.cvut.cz/publicDatasets/CTUMalwareCaptureBotnet
3701/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUMalwareCaptureBotnet
3711/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUMalwareCaptureBotnet
3721/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUMalwareCaptureBotnet
3731/bro/conn.log
−−−−−−−−−−−−−−−−− Benign Traffic Logs Sources −−−−−−−−−−−−−−−
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal25/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal26/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal27/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal28/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal29/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal30/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal31/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal32/bro/conn.log
https://mcfp.felk.cvut.cz/publicDatasets/CTUNormal33/bro/conn.log
ent threshold values are tuned to select the one that gives the
highest accuracy metrics on each dataset.
3.4. IoT-Zeek Dataset Generation
In the following, we describe the methodology used to
generate the IoT-Zeek dataset of malicious and benign net-
work traffic. We first deploy a real environment which con-
sists of various raspberry pi devices that communicate with
each other. We install Zeek sensors to monitor the network
traffic and extract connection logs (conn.log) generated by
the Zeek NIDS Paxson (1999); Team (2018). Then, we in-
ject different malware samples to these devices and capture
malicious network traffic. These connection logs are then
classified using classical machine learning and deep learning
models into malicious or benign connections (as explained
later). A portion of the dataset, which contains 150,000
records (connections), consisting a total of 129,441 mali-
cious connections and 20,559 benign connections, is sam-
pled.
Malware and Benign Sources: To ensure the freshness of
our dataset regrading the malicious / benign IP addresses,
we collect PCAPS from both Concordia SecLab malware
feed (in house source) and Stratosphere Research Labora-
tory Laboratory (2018). Then, we build the global training
dataset from the labeled Zeek logs. The malicious traffic
logs and the benign traffic logs are retrieved from the sources
presented in Listing 1. The number of malicious and be-
nign connections in the evaluation dataset are 1,764,604 and
278,998, respectively.
3.4.1. Maliciousness Classification
In this section, we present employed ensemble models
to classify the malicious activities on the IoT-Zeek dataset.
As depicted in Figure 3, the first set of models belongs to
classical machine learning, while the second one belongs to
deep learning. The input to the models is all connection log
features as presented in Table 1 (there exist some other fea-
tures, which are not extracted from PCAP files by Zeek in
Aniss Chohra et al.: Preprint submitted to Elsevier Page 8 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 3: Maliciousness Detection Pipeline.
Table 1
Zeek’s Connection Log File Features Description.
Feature Typ e Description
ts Numerical Unix Timestamp format of the connection’s occurance date.
id.orig_h Categorical Originator’s IP address.
id.orig_p Categorical Originator’s TCP/UDP port.
id.resp_h Categorical Responder’s IP address.
id.resp_p Categorical Responder’s TCP/UDP port.
proto Categorical The transport layer protocol (tcp, udp, or icmp).
service Categorical The application layer requested protocol (e.g; ssh, dns, etc.)
orig_ip_bytes Numerical Number of bytes sent from the originator to the responder; this is extracted from the packet header.
resp_ip_bytes Numerical Number of bytes sent from the responder to the originator.
orig_pkts Numerical Number of packets sent from the originator to the responder.
resp_pkts Numerical Number of packets sent from the responder to the originator.
conn_state Categorical A string giving an overview description about the state of the connection.
history Categorical A string giving more details about the state of the connection.
the default setting)5.
As for the classical machine learning classification, we
employ RandomForest,XGBoost,LightGBM, and CatBoost
classifiers. We choose these classifiers due to their high
performance and reputation in the industry. Moreover, the
chosen classifiers were parts of many winning solutions in
machine learning competitions6. In addition to the classi-
cal machine learning models, we deploy two deep learning
models for maliciousness detection. This includes the con-
volutional neural networks (CNN) and the feed forward neu-
ral networks (NN). More specifically, we customize the ar-
chitecture of CNN model to learn the maliciousness of the
network traffic, as shown in Figure 4. Moreover, the details
of the model are presented in Table 2. Other parameters,
such as Filters, are obtained from experiments and trade off
between the size of the model and the accuracy of the model.
Kernel and Stride have pretty standard values in many ma-
chine learning papers in the context of CNN. The feed for-
ward neural network architecture is a typical neural network
with fully connected layers. The details of the model are
listed in Table 3.
Ensemble Models: Training the aforementioned machine
learning classifiers on the connection logs features produces
a set of models 𝑀= {𝑐𝑀1, 𝑐𝑀2, 𝑐𝑀3, 𝑐𝑀4, 𝑑𝑀1, 𝑑 𝑀2},
5https://docs.zeek.org/en/lts/scripts/base/protocols/conn/main.
zeek.html
6https://www.kaggle.com/competitions
Table 2
Dimension Convolutional Neural Network Model Details.
Block # Layers Options
Block1
1 Conv Filter=64, Kernel=(3,1), Stride=(1,1),
ZeroPadding, Activation=ReLU
2 BNorm BatchNormalization
3 MaxPooling Kernel=(2,2), Stride=(2,2), Zero-Padding
Block2
4 MaxPooling Global Max Pooling
5 Fully Connected #Output=512, Activation=ReLU
6 Fully Connected #Output=1, Activation=Sigmoid
Table 3
Feed Forward Neural Network Model Details.
# Layers Options
1 Fully Connected #Output=128, Activation=ReLU
2 Batch Normalization Batch Normalization
3 Fully Connected #Output=256, Activation=ReLU
4 Batch Normalization Batch Normalization
5 Fully Connected #Output=512, Activation=ReLU
6 Batch Normalization Batch Normalization
7 Fully Connected #Output=512, Activation=ReLU
8 Fully Connected #Output=1, Activation=Sigmoid
where 𝑐𝑀𝑖represents a classical machine learning mod-
el/learner (i.e., RandomForest,XGBoost,LightGBM, and
CatBoost classifiers) and 𝑑𝑀𝑖represents a deep learning
model/learner (i.e., CNN and NN). To perform ensemble
learning, we use ensemble averaging technique as presented
Aniss Chohra et al.: Preprint submitted to Elsevier Page 9 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 4: CNN Maliciousness Detection Model’s Architecture.
in Equation 5, as follows:
𝑌(𝑋, 𝛼 ) = |𝑀|
𝑖=1
𝛼𝑖×𝑦𝑖× (𝑋)(5)
where 𝑌is the ensemble probability likelihood, 𝑋is the
input feature vector, 𝛼𝑖are weights, and 𝑦𝑖are the prob-
ability likelihood of each single model. Each individual
model/learner (classical machine learning and deep learn-
ing models as presented in Figure 3) produces a single
probability (𝑦𝑖), which represents the maliciousness likeli-
hood. These models detection’ probabilities are input to
the weighted average ensemble. This technique employs a
weighted average using 𝛼𝑖weights to produce the ensemble
prediction. In the current setting, we choose 𝛼𝑖= 1 for all
the models, which indicates that all the models have equal
contribution in the final decision.
3.4.2. System Adaptation
Adaptation to new network threats and attacks is an im-
portant criterion in the network traffic malicious detection.
In our context, we provide this capability thought the au-
tomation of the model generation process. As shown in Fig-
ure 5, the system leverages a feed of malicious traffic (in
form of PCAP files) to build an updated training dataset ev-
ery epoch. The updated training dataset is representative of
the state-art-the-art network attacks and benign traffics. The
system insures the quality of the produced model by using
validation and testing datasets, and only models that achieve
high detection performance will be deployed in production.
4. Evaluation Results and Discussion
In this section, we first describe the experimental setup,
and the benchmark datasets. Then, we provide more details
on the validation of our proposed feature selection approach
on each of the chosen benchmark datasets. Next, we report
the accuracy of our anomaly detection model on different
datasets and compare our results with the state-of-the-art ap-
proaches. Finally, we provide the results of our efficiency
study.
4.1. Experimental Setup
All our experiments are conducted on a computation
server with an Intel Xeon E5-2630 2.30 GHz CPU with 24
cores and 128 GB of RAM, and CentOS Linux version 7
installed on it. Our system prototype is developed using
Python 3.6 programming language and PyTorch by leverag-
ing sklearn and Scikit libraries for bagging ensemble learn-
ing techniques (random forest classifier), xgboost library for
boosting ensemble techniques (gradient boosting classifier)
and other machine learning models. Multiprocessing is de-
ployed for fast models’ training by taking advantage of 20
cores of the CPU for both Random Forest and XGBoost clas-
sifiers. We use pandas API in order to load and preprocess
each API. We adapt the autoencoders models by utilizing the
keras API in conjunction with tensorflow.
Evaluation Metrics. To evaluate the performance of our
approach, we use the accuracy, precision, recall and 𝐹1score
metrics that are defined as follows:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇 𝑃 +𝑇 𝑁
𝑇 𝑃 +𝑇 𝑁 +𝐹 𝑃 +𝐹 𝑁
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃 , 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁
𝐹1= 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙
4.2. Benchmark Datasets Description
In this subsection, we introduce the two benchmark
datasets as follows.
NSL-KDD Dataset: A network dataset, called NSL-KDD
(Tavallaee et al. (2009)), was proposed to fix two main issues
(e.g., redundant records and level of difficulty) related to its
predecessor KDD’99 dataset. The updated dataset (NSL-
KDD) contains a total of 148,517 network flow records, with
77,054 being labeled as normal records and 71,463 being
Aniss Chohra et al.: Preprint submitted to Elsevier Page 10 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 5: Machine Learning Models Development.
labeled as attacks. The dataset consists of a total of 41 fea-
tures, 32 of which are numerical (integer or float) type and 9
features have categorical values.
UNSW-NB15 Dataset: The UNSW-NB15 dataset
(Moustafa and Slay (2015); Moustafa and Slay (2016a);
Moustafa et al. (2019); Moustafa et al. (2017)) was created
by the Cyber Range Laboratory for Cyber Security (ACCS)
using IXIA PerfectStorm framework that contain real normal
and attack behaviours. Tcpdump is then used as a framework
to capture 100GB of network traffic activity. The dataset
consists of nine types of cyber attacks: Fuzzers,Analysis,
Backdoors,Denial of Sevice (DoS),Exploits,Generic,Re-
connaissance,Shellcode, and Worms. Moreover, generated
dataset contains a total of 49 features, 42 of which are
numerical (integer or float) type, and 6 are of categorical
type. This dataset contains 2,218,761 normal and 321,283
attack records.
4.3. Feature Selection Results
We apply our proposed optimized feature selection so-
lution on the three aforementioned datasets. The results for
each of the two explored fitness functions, Random Forests
(bagging) and XGBoost (boosting), are detailed respectively
in Table 4 and Table 5. We observe that the latter fitness
function (XGBoost) achieves the highest fitness values (𝑓1
score) on all three datasets. Moreover, when the 𝑐2accel-
eration constant is higher than 𝑐1(𝑐2=2and 𝑐1=1), the
algorithm performs better in finding better optimal solutions
for two of the datasets, while for the NSL-KDD dataset both
settings give the same values of 𝑓1score.
Afterwards, for each of these set of hyper-parameters se-
lected for each dataset, we train the appropriate model (XG-
Boost), and extract the list of features with their correspond-
ing importance values. Then, we compute their averages and
select only the ones which their importance is higher than
the obtained average value. The results of this process are
presented in Table 6.
Effects of Imbalanced Data: We further examine the ef-
fects of imbalanced data on our feature selection approach.
As presented in Section subsection 3.4, IoT-Zeek dataset has
a smaller number of benign connections compared to mali-
cious connections, which may influence machine learning
algorithms to ignore the minority class. According to the
literature Fernández et al. (2018), oversampling and under-
sampling techniques are recommended to overcome this is-
sue. To this end, we leverage SMOTE and RUS python li-
braries7and apply both oversampling and undersampling
techniques on the IoT-Zeek data. According to obtained re-
sults, oversampling technique slightly outperforms under-
sampling technique. Consequently, we consider the over-
sampled dataset during our experiments and refer to it as
IoT-Zeek-Oversampled dataset.
We apply our optimized feature selection solution on the
IoT-Zeek-Oversampled dataset. The results of the two ex-
plored fitness functions are presented in Table 7. We ob-
serve that the XGBoost fitness function achieves the high-
est fitness values (𝑓1score). More specifically, when the 𝑐2
acceleration constant is equal to 𝑐1(𝑐2 = 2 and 𝑐1 = 2),
the XGBoost algorithm outperforms in finding better opti-
mal solutions. Afterwards, for each of the selected set of
hyper-parameters, we train the appropriate XGBoost model,
7https://imbalanced-learn.org/stable/index.html
Aniss Chohra et al.: Preprint submitted to Elsevier Page 11 of 19
Optimized Feature Selection for Network Anomaly Detection
Table 4
Feature Selection Results using Random Forests Classifier as Fitness Function (Acceleration
constant c2 is fixed to 2 whilst c1 is tuned, 𝑓1score is the fitness function)
.
Dataset c1 Test size #Trees Accuracy 𝑓1score Precision Recall
NSL-KDD
20.1 71 99.52 99.52 99.52 99.52
0.1 70 99.52 99.52 99.52 99.52
0.1 323 99.51 99.51 99.51 99.5
0.103 295 99.51 99.51 99.51 99.51
0.12 50 99.5 99.51 99.51 99.5
0.1 107 99.52 99.51 99.52 99.51
0.15 258 99.51 99.51 99.51 99.5
0.2 50 99.52 99.51 99.52 99.51
0.1 424 99.5 99.5 99.5 99.5
10.1 50 99.549 99.549 99.549 99.54
0.1 63 99.51 99.51 99.51 99.51
0.1 51 99.51 99.51 99.51 99.51
0.1002 153 99.51 99.51 99.51 99.51
0.105 50 99.51 99.51 99.51 99.51
UNSW-NB15
20.1 50 99.93 99.49 99.28 99.69
0.1 84 99.93 99.43 99.15 99.71
0.103 50 99.93 99.42 99.17 99.68
10.1 1000 99.92 99.4 99.1 99.70
0.1005 1000 99.92 99.4 99.1 99.70
0.106 1000 99.92 99.4 99.09 99.70
IoT-Zeek Dataset
20.1 724 99.99 99.99 100 99.98
0.111 106 99.99 99.99 100 99.98
0.112 980 99.99 99.99 100 99.98
10.214 80 99.99 99.99 100 99.98
0.103 111 99.99 99.99 100 99.98
0.295 50 99.99 99.99 100 99.98
Table 5
Feature Selection Results using XGBoost Classifier as Fitness Function (Acceleration con-
stant c2 is fixed to 2 whilst c1 is tuned, 𝑓1score is the fitness function).
Dataset c1 Test size #Trees Learning rate Accuracy 𝑓1score Precision Recall
NSL-KDD
20.102 376 0.162523 99.75 99.75 99.75 99.75
0.13 327 0.138372 99.75 99.75 99.75 99.73
0.104 292 0.16305 99.75 99.75 99.75 99.75
0.1 233 0.1473 99.739 99.739 99.739 99.73
0.10558 241 0.17077 99.73 99.73 99.7 99.73
10.106 680 0.1 99.75 99.75 99.75 99.75
0.105 681 0.1 99.75 99.75 99.75 99.73
0.106 686 0.1 99.75 99.75 99.75 99.75
UNSW-NB15
20.1 903 0.1 99.97 99.76 99.71 99.82
0.1 824 0.1 99.97 99.76 99.70 99.82
0.1 827 0.1 99.97 99.76 99.71 99.82
0.16 903 0.102 99.90 99.60 99.70 99.8
0.1 899 0.10013 99.90 99.60 99.70 99.8
10.1 1000 0.1 99.90 99.80 99.80 99.87
0.1 1000 0.137 99.96 99.70 99.65 99.77
0.158 1000 0.141 99.96 99.71 99.68 99.74
0.17 816 0.12 99.96 99.69 99.67 99.71
0.114 425 0.159 99.96 99.67 99.57 99.77
0.137 481 0.125 99.96 99.67 99.61 99.73
IoT-Zeek Dataset
20.4 1000 0.294 99.90 99.90 99.90 99.90
0.4 1000 0.158 99.90 99.90 100 99.90
0.325 730 0.3 99.90 99.90 100 99.90
0.1 411 0.1 99.90 99.90 100 99.98
1
0.369 510 0.3 99.99 99.99 100 99.99
0.360 624 0.3 99.99 99.99 100 99.98
0.397 382 0.132 99.99 99.99 99.99 99.99
0.361 664 0.3 99.99 99.99 100 99.97
and extract the list of features with their corresponding im-
portance values. Then, we select only the features with an
importance higher than their average values, as presented in
Table 8.
4.4. Anomaly Detection Results
In this section, we describe the architecture of our au-
toencoder models for each of the utilized datasets, and
call them NSL-KDD Model, UNSW-NB15 Model, IoT-
Zeek Model, and IoT-Zeek-Oversampled Model. Then, we
present the results of models, and then we compare the re-
sults obtained for each dataset’s model with the state-of-the-
art approaches presented in Yang et al. (2019).
NSL-KDD Model: After several iterations of model train-
ing, we found that the optimal anomaly detection model for
this dataset has five hidden layers: two for the encoder (128
and 64 neurons respectively), one layer for the bottleneck
(32 neurons), and two others for the decoder (64 and 128
Aniss Chohra et al.: Preprint submitted to Elsevier Page 12 of 19
Optimized Feature Selection for Network Anomaly Detection
Table 6
Selected Features on each Dataset using the Optimal Solution Hyper-parameters. (Accel-
eration constant 𝑐2is fixed to 2and 𝑐1is tuned between 1and 2).
Dataset & Model Feature name Feature importance
NSL-KDD src_bytes 0.298222400
Test size: 0.106 num_failed_logins 0.131071240
Number of trees: 680 service 0.074615410
Learning rate: 0.1 diff_srv_rate 0.054890107
flag 0.039971426
hot 0.037307087
count 0.027289085
dst_host_srv_diff_host_rate 0.025930267
dst_host_same_srv_rate 0.024637770
UNSW-NB15 sttl 0.087299424
Test size: 0.1 ct_state_ttl 0.059610307
Number of trees: 1000 dsport 0.018620330
Learning rate: 0.1 proto 0.007498254
IoT-Zeek Dataset ts 0.672937750
Test size: 0.369 id_orig_p 0.121201570
Number of trees: 510 history 0.110579970
Learning rate: 0.3 resp_ip_bytes 0.089164086
Table 7
Feature Selection Results on IoT-Zeek-Oversampled Dataset.
Fitness Function (𝑓1score) c1 Test size #Trees Accuracy 𝑓1score Precision Recall
Random Forest (C2=2)
20.4 478 99.99 99.99 99.99 99.99
0.4 518 99.99 99.99 99.99 99.99
0.4 534 99.99 99.99 99.99 99.99
10.4 50 99.99 99.99 99.99 99.99
0.4 116 99.99 99.99 100 99.99
0.4 431 99.99 99.99 99.99 99.99
(C2=2)
20.4 50 100 100 100 100
0.3054 50 100 100 100 100
0.4 50 100 100 100 100
10.4 492 99.99 99.99 99.99 99.99
0.3963 679 99.99 99.99 99.99 99.99
0.4 739 99.99 99.99 99.99 99.99
respectively). In addition, we used two activities regular-
ization functions to deal better with the overfitting, namely,
dropout=0.5 and l2 norm for kernel regularization at each
layer with a value of 0.001 (as shown in the first part of Ta-
ble 9). Moreover, this autoencoder is a deeply connected
autoencoder, such that all layers are of Dense layer type.
Each of these layers uses Relu as activation function. The
optimal size of batches is set to be 32 with the testing data
set size equal to the optimal one found in the optimiza-
tion algorithm (0.106). Additionally, we tuned the model
with three different loss functions: categorical crossentropy,
mean squared error, and mean absolute error. The results
of this model validation with different thresholds are pre-
sented in Table 10. As can be seen, the model performs bet-
ter with categorical crossentropy as the loss function, with an
optimal threshold equal to 0.512, achieving approximately
92.09% average of 𝑓1score metric.
UNSW-NB15 Model: By using this dataset and after ap-
plying the same model tuning steps used for the NSL-KDD
dataset, we found that the optimal model has exactly the
same regularization values at each layer (dropout=0.5 and
kernel_regularizer_l2=0.001). However, there are two dif-
ferences compared to the previous model on the NSL-KDD
dataset. First, with this dataset there are exactly seven hid-
den layers (enocder=[512,256,128], bottleneck=[64], and
Table 8
Selected Features on IoT-Zeek-Oversampled Dataset using the Optimal Solution Hyper-
parameters.
Model Feature name Feature importance
ts 0.254218453
Test size: 0.3905 resp_ip_bytes 0.161528458
Number of trees: 50 resp_pkts 0.152892584
Learning rate 0.2648 resp_bytes 0.084122151
id_orig_p 0.083250296
Aniss Chohra et al.: Preprint submitted to Elsevier Page 13 of 19
Optimized Feature Selection for Network Anomaly Detection
Figure 6: Deep Learning Anomaly Detection: Train and Validation Loss.
Figure 7: Autoencoder Anomaly Detection ROC Curves.
decoder=[128,256,512]), as shown in the second part of Ta-
ble 9, with the batch size being set to 64. Second, this model,
contrary to the previous one, performs better with the mean
squared error loss function, achieving an optimal threshold
of 2.239 with an overall 𝑓1score average equal to 92.904 (as
shown in Table 10).
IoT-Zeek Model: We achieve an overall average of 𝑓1score
equal to 97.302 on this dataset (as shown in Table 10) using
mean squared error loss function. The autoencoder model
for this dataset is the same as the one for NSL-KDD, except
that we give it a smaller value for the kernel regularization
function (0.0001), as shown in the third part of Table 9.
IoT-Zeek-Oversampled Model: We apply the same hyper-
parameters that were tuned for the NSL-KDD Model as
shown in Table 9, and train our autoencoder model on the
IoT-Zeek-Oversampled dataset using mean squared error
loss function. The ROC curve is shown in Figure 8, where
the model achieves 99% area under the curve (AUC). More-
over, obtained results for different threshold values are pre-
sented in Table 10. As seen, we achieve an 𝑓1score of
94.300 on the oversampled data, while the obtained 𝑓1score
on the original data was 97.302. The reason for the 3% drop
in the 𝑓1score can be explained with the selected features
and their importance before and after oversampling, as pre-
sented in Table 6 and Table 8, respectively. Since feature
selection technique applies statistical methods to find the
best features, if the population (the dataset) changes (due to
over/under -sampling), the importance of selected features
will most likely be different, which will affect the overall
performance.
Figure 8: Autoencoder Anomaly Detection ROC Curve on IoT-
Zeek-Oversampled Dataset.
We further measure the training and validation loss for
each of the aforementioned models as presented in Figure 6;
where we can see that for all three models, there is no over-
fitting and each model’s loss becomes stable around epoch
6. Moreover, Figure 7 shows the Receiver Operating Char-
acteristic (ROC) curve for each one of the aforementioned
models (NSL-KDD Model, UNSW-NB15 Model, and IoT-
Zeek Model explained in subsection 4.4), where all the three
models achieve more than 90% area under the curve (AUC),
with IoT-Zeek Model achieving almost 100% AUC.
4.5. Comparative Study
We further compare our two models that we trained
on both NSL-KDD and UNSW-NB15 benchmark datasets,
with the most prominent state-of-the-art anomaly detection
recent models (e.g., the ones presented in Yang et al. (2019))
applied on these datasets. According to our experiments,
our proposed autoencoders outperform them in terms of 𝑓1
score metric. The results of this comparison are depicted in
Figure 9a and Figure 9b, for NSL-KDD and UNSW-NB15
respectively, where for both datasets, our proposed models
achieve the highest values of 𝑓1score. Moreover, we com-
pare the performance of our proposed approach with the
aforementioned selected work in terms of accuracy metric on
Aniss Chohra et al.: Preprint submitted to Elsevier Page 14 of 19
Optimized Feature Selection for Network Anomaly Detection
Table 9
Proposed Autoencoder Architecture by Dataset.
Dataset Encoder Bottleneck Deco der Regularizations
NSL-KDD 1. layer 1: 128 neurons
2. layer 2: 64 neurons
layer 3: 32 neurons 1. layer 4: 64 neurons
2. layer 5: 128 neurons
Dropout: 0.5
L2-regularizer: 0.001
UNSW-NB15 1. layer 1: 512 neurons
2. layer 2: 256 neurons
3. layer 3: 128 neurons
layer 4: 64 neurons 1. layer 5: 128 neurons
2. layer 6: 256 neurons
3. layer 7: 512 neurons
Dropout: 0.5
L2-regularizer: 0.001
IoT-Zeek 1. layer 1: 128 neurons
2. layer 2: 64 neurons
layer 3: 32 neurons 1. layer 4: 64 neurons
2. layer 5: 128 neurons
Dropout: 0.5
L2-regularizer: 0.0001
IoT-Zeek-
Oversampled 1. layer 1: 128 neurons
2. layer 2: 64 neurons
layer 3: 32 neurons 1. layer 4: 64 neurons
2. layer 5: 128 neurons
Dropout: 0.5
L2-regularizer: 0.001
Table 10
Chameleon Deep Learning Anomaly Detection Results.
Dataset & Model Threshold Accuracy Precision Recall 𝑓1score
Dataset: NSL-KDD 0.105 86.191 81.481 98.021 88.989
Loss function: categorical crossentropy 0.087 86.072 80.618 99.439 89.045
Training Time: 6mins, 13sec 0.177 87.833 84.065 97.016 90.077
2.190 89.092 90.753 90.010 90.380
1.758 89.532 90.640 91.008 90.824
1.018 89.607 89.965 92.005 90.974
0.837 90.073 89.900 93.010 91.429
0.314 90.006 87.624 96.002 91.622
0.742 90.592 89.922 94.008 91.920
0.512 90.711 89.351 95.005 92.092
Dataset: UNSW-NB15 1.352 84.382 79.346 98.099 87.731
Loss function: mean squared error 1.342 84.302 78.795 99.088 87.784
Training Time: 28mins, 38sec 2.365 86.058 86.124 90.010 88.024
2.346 86.387 85.913 91.008 88.387
2.325 86.728 85.705 92.036 88.758
2.305 87.083 85.546 93.026 89.129
2.286 87.456 85.406 94.031 89.511
2.265 87.757 85.169 95.044 89.836
2.162 87.629 83.805 97.016 89.927
2.239 89.523 90.00 96.002 92.904
Dataset: IoT-Zeek 3.101 98.158 96.344 90.004 93.066
Loss function: mean squared error 3.098 98.288 96.329 91.002 93.590
Training Time: 8mins, 32sec 3.097 98.407 96.235 92.001 94.070
3.094 98.530 96.157 93.012 94.558
3.092 98.651 96.080 94.010 95.034
3.090 98.777 96.043 95.009 95.523
3.088 98.894 95.944 96.007 95.975
3.086 99.019 95.897 97.005 96.448
3.084 99.134 95.789 98.003 96.884
3.082 99.246 95.659 99.002 97.302
Dataset: IoT-Zeek-Oversampled 95.921 97.321 90.444 90.004 90.223
Loss function: mean squared error 90.063 97.458 90.538 91.002 90.770
89.015 97.595 90.631 92.001 91.311
87.177 97.734 90.724 93.012 91.854
85.143 97.871 90.813 94.010 92.384
3.364 97.817 86.922 99.002 92.569
82.043 97.979 90.719 95.009 92.814
78.225 98.099 90.694 96.00 93.275
78.037 98.236 90.781 97.005 93.790
76.116 98.373 90.866 98.003 94.300
both datasets. The results of this comparative study are re-
ported in Figure 11 for both NSL-KDD dataset and UNSW-
NB15 dataset.
More specifically, we compare the accuracy results ob-
tained during the testing (using hold-out dataset) of our au-
toencoders models (NSL-KDD Model and UNSW-NB15
Model) trained on both benchmark datasets with the reported
accuracy results of existing state-of-the-art models tested on
NSL-KDD dataset (e.g, Yang et al. (2019); Ma et al. (2016);
Javaid et al. (2016); Tang et al. (2016); Imamverdiyev and
Aniss Chohra et al.: Preprint submitted to Elsevier Page 15 of 19
Optimized Feature Selection for Network Anomaly Detection
(a) NSL-KDD Dataset. (b) UNSW-NB15 Dataset.
Figure 9: A Comparative Study of Anomaly Detection Approaches in terms of 𝑓1score.
Table 11: A Comparative Study of Anomaly Detection Approaches in terms of Accuracy.
Approach NSL-KDD Dataset UNSW-NB15 Dataset Average Accuracy
Chameleon990.71% 89.52% 90.115%
Rashid et al. (2022) 99.90% 94.00% 96.95%
MemAE Min et al. (2021) 89.51% 85.30% 87.405%
Roy et al. (2022) 98.50% Not Reported -
CNN Mahalakshmi et al. (2021) Not Reported 93.50% -
J48 Roy and Singh (2021) Not Reported 87.65% -
ICVAE-DNN Yang et al. (2019) 85.97% 89.08% 87.525%
GB-RBM Imamverdiyev and Abdullayeva (2018) 73.23% Not Reported -
RNN-IDS Yin et al. (2017) 81.29% Not Reported -
ID-CVAE Lopez-Martin et al. (2017) 80.10% Not Reported -
CASCADE-ANN Baig et al. (2017) Not Reported 86.40% -
DNN Tang et al. (2016) 75.75% Not Reported -
STL Javaid et al. (2016) 74.38% Not Reported -
SCDNN Ma et al. (2016) 72.64% Not Reported -
DT Moustafa and Slay (2016b) Not Reported 85.56% -
EM Clustering Moustafa and Slay (2016b) Not Reported 78.47% -
Abdullayeva (2018); Yin et al. (2017); Lopez-Martin et al.
(2017); Min et al. (2021)) and on UNSW-NB15 dataset (e.g.,
Yang et al. (2019); Baig et al. (2017); Moustafa and Slay
(2016b); Roy and Singh (2021); Mahalakshmi et al. (2021);
Min et al. (2021)) as presented in Figure 11.
In Roy and Singh (2021), the authors examine different
anomaly detection classifiers on the UNSW-NB15 dataset
before and after applying feature selection. The reported re-
sults show that the J84 classifier achieves the highest accu-
racy of 87.65%, outperforming slightly the case where no
feature selection is applied (with an accuracy of 87.44%).
However, the authors have not measured other performance
evaluation metrics (e.g., f1score, recall, and precision). The
authors of Mahalakshmi et al. (2021) applied a CNN model
on UNSW-NB15 dataset and detect anomalies with an ac-
curacy of 93.5%. However, the authors have not examined
their approach on the NSL-KDD dataset. Moreover, they
have not reported additional performance metrics (e.g., f1
score, recall, precision) during their evaluation.
In Min et al. (2021), the authors introduced MemAE by
using autoencoders to reconstruct the behavior of abnormal
samples that look close to normal ones. They achieve an
accuracy of 89.51% and f1-score of 89.93% on NSL-KDD
dataset, as well as 85.3% accuracy and 85.26% f1-score on
UNSW-NB15 dataset. However, the authors have not con-
sidered using feature selection prior to their autoencoder
anomaly detection model to show the difference between the
two scenarios.
Roy et al. (2022) propose B-Stacking, a lightweight su-
pervised intrusion detection based on machine learning en-
semble that uses K-Nearest Neighbors (KNN), Random For-
est, and XGBoost to detect network anomalies. The ap-
proach has been evaluated only on the NSL-KDD dataset,
with an accuracy of 98.50% and f1score of 99.00%, and has
not been tested on the UNSW-NB15 dataset. The authors
in Rashid et al. (2022) propose a stacking ensemble tech-
nique (SET) with SelectKBest feature selection technique
and an ensemble of Decision Trees, Random Forest, and
XGBoost machine learning models for network anomaly de-
tection. Performed experiments demonstrate that SET ob-
tains an accuracy and 𝑓1score of 94.00% and 94.00% on
UNSW-NB15 dataset, and 99.90% on both accuracy and 𝑓1
score on NSL-KDD datasets. However, the use of Selec-
tKBest technique is less adaptive to new malicious network
traffic over the time. In contrast, our proposed solution,
CHAM ELEO N, employs autoencoder model, which is more
resilient to new threats since our model uses unsupervised
techniques. Moreover, CHAMELE ON has two sub-detection
modules: supervised (XGBoost and Random Forest for clas-
sification) and unsupervised (deep autoencoders for anomaly
detection), which both module achieve promising accuracy
results. Although our main objectives it to perform anomaly
detection, obtained classification results presented in Table 5
and Table 7 demonstrate high 𝑓1scores, which outperform
Aniss Chohra et al.: Preprint submitted to Elsevier Page 16 of 19
Optimized Feature Selection for Network Anomaly Detection
the reviewed existing approaches. The results reported in Ta-
ble 10 obtained from our anomaly detection approach uses
the autoencoders, which are considered high in the context
of anomaly detection (unsupervised).
Amongst the aforementioned works, Yang et al. (2019)
deploy a combination of variational autoencoders and deep
neural networks (DNN) to detect anomalies, which achieves
the highest accuracy and 𝑓1score of 85.97% and 86.27%
on NSL-KDD, and those of 89.08% and 90.61% on UNSW-
NB15. Ma et al. (2016) combine both spectral cluster-
ing and DNN achieving 72.64% of accuracy on NSL-
KDD, and Javaid et al. (2016) deploy self-taught learning
reporting with accuracy of 74.38% on NSL-KDD. Tang
et al. (2016) employ DNN and obtain 75.75% accuracy on
NSL-KDD. Imamverdiyev and Abdullayeva (2018) deploy
a Gaussian-Bernoulli based Recurrent Boltzmann Machine
achieving 73.23% accuracy on NSL-KDD, while Yin et al.
(2017) propose a novel IDS using recurrent neural networks
(RNNs) reporting 81.29% accuracy on NSL-KDD. On the
other hand, Lopez-Martin et al. (2017) propose an intru-
sion detection system based on conditional variational au-
toencoders (CVAE), achieving 80.1% accuracy on the NSL-
KDD dataset. Baig et al. (2017) introduce a novel approach
for intrusion detection using multi-cascading artificial neural
networks achieving an accuracy of 86.4% on UNSW-NB15
dataset. Moustafa and Slay (2016b) deploy two approaches;
the first one uses expectation-maximization clustering tech-
nique in order to detect anomalies efficiently achieving
78.47% accuracy on UNSW-NB15 dataset, and the second
approach deploys decision trees on the same dataset and
records an accuracy of 85.56%. However, given all this in-
formation, we notice that our work outperforms the afore-
mentioned state-of-the-art works by achieving 90.711% of
accuracy and 92.092% f1-score on NSL-KDD dataset, and
89.523% of accuracy and 92.904% of f1-score on UNSW-
NB15 dataset.
The advantages of our approach over aforementioned ex-
isting works are as follows:
Feature selection: where our work is amongst a few
proposed approaches (e.g., Roy and Singh (2021)) that
introduces the selection of the most important fea-
tures through PSO algorithm before applying a de-
tection model. This leads to achieving more accu-
rate model, since feature selection helps filter non im-
portant/relevant features (noisy data) from the dataset,
which leads to classify each class/label more accu-
rately and results in more accurate models. In addi-
tion, feature selection provides better efficiency and
scalability compared to existing models that use the
whole features of the datasets.
Evaluation on recent real-world IoT dataset: while
existing works evaluated their approaches on the
most common benchmark datasets (NSL-KDD and
UNSW-NB15), none of them conduct experiments on
real-world IoT dataset. On the contrary, we first gen-
erate our own real-word IoT dataset, and then apply
our models to that real-world IoT dataset in addition
to those non-IoT datasets. This makes our approach
more realistic and applicable to recent security prob-
lems.
4.6. Efficiency
In this section, we examine the execution time for the
optimization feature selection algorithm depending on the
chosen fitness function. The obtained results are presented in
Figure 10. The execution time relatively high for the UNSW-
NB15 dataset), due to the huge number of records as well
as large number of features. However, we do not consider
this as an issue, since the optimized feature selection task is
executed only once on each dataset.
Figure 10: Optimized feature selection execution times.
5. Concluding Remarks and Limitations
Optimization of non-deterministic tasks in machine
learning and deep learning is becoming a new widespread
approach to help developers find optimal hyper-parameter
settings and use them to build their classification, regression,
or clustering models. This paper presented a novel approach
which focuses on finding the optimal hyper-parameters for
ensemble methods in order to select the important features
on a given networking dataset. The proposed approach is de-
veloped by combining ensemble methods with a swarm in-
telligence optimization algorithm (PSO). Our validation re-
sults prove that the proposed algorithm finds the optimal so-
lutions better when tuned with boosting (XGBoost) ensem-
ble techniques rather than bagging (Random Forest) ones.
Moreover, we used the optimal solutions detected by the
optimization algorithm in order to select the appropriate set
of features on each validation dataset. Using only those fea-
tures, we built and tuned an anomaly detection autoenocoder
for each one of these datasets. Obtained evaluation results
demonstrate that our anomaly detection models outperform
the most efficient state-of-the-art techniques applied on these
datasets. Additionally, it achieves reasonable and reduced
training time delays.
However, there are some limitations to our work that
need to be addressed in the future. The first one consists of
the fact that we used only two hyper-parameters for the opti-
mization algorithm when using Fandom Forests (number of
trees and test size), and three when using it with XGBoost
Aniss Chohra et al.: Preprint submitted to Elsevier Page 17 of 19
Optimized Feature Selection for Network Anomaly Detection
(number of trees, test size, and learning rate). We are cur-
rently exploring the possibility of adding (optimizing) more
hyper-parameters. In addition, we need to improve the scal-
ability (execution times) of the feature selection (optimiza-
tion) algorithm. Although this latter does not pose an issue,
since it needs to be run only once for each dataset and not
only on a regular basis. Furthermore, we have not explored
the setting of PSO hyper-parameters (𝑐1,𝑐2, and 𝑤) in an
adaptable fashion, which can also improve the search effi-
ciency; this involves the usage of some variations of PSO,
such as Adaptive Particle Swarm Optimization (APSO) Zhan
et al. (2009), in order to find the optimal settings for these
three hyper-parameters.
References
Ahmad, A., Khan, M., Paul, A., Din, S., Rathore, M.M., Jeon, G., Choi,
G.S., 2018. Toward modeling and optimization of features selection in
big data based social internet of things. Future Generation Computer
Systems 82, 715–726.
Ahmed, M., Mahmood, A.N., Hu, J., 2016. A survey of network anomaly
detection techniques. Journal of Network and Computer Applications
60, 19–31.
Ali, W., Malebary, S.J., 2020. Particle swarm optimization-based feature
weighting for improving intelligent phishing website detection. IEEE
Access 8, 116766–116780.
Alsaheel, A., Nan, Y., Ma, S., Yu, L., Walkup, G., Celik, Z.B., Zhang,
X., Xu, D., 2021. {ATLAS}: A sequence-based learning approach for
attack investigation, in: 30th USENIX Security Symposium (USENIX
Security 21).
Baig, M.M., Awais, M.M., El-Alfy, E.S.M., 2017. A multiclass cascade
of artificial neural network for network intrusion detection. Journal of
Intelligent & Fuzzy Systems 32, 2875–2883.
Bühlmann, P., 2012. Bagging, boosting and ensemble methods, in: Hand-
book of Computational Statistics. Springer, pp. 985–1022.
Chalapathy, R., Chawla, S., 2019. Deep learning for anomaly detection: A
survey. arXiv preprint arXiv:1901.03407 .
Chalapathy, R., Khoa, N.L.D., Chawla, S., 2020. Robust deep learn-
ing methods for anomaly detection, in: Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data
Mining (KDD’20), pp. 3507–3508.
Doan, M., Zhang, Z., 2020. Deep learning in 5G wireless networks-
anomaly detections, in: 29th Wireless and Optical Communications
Conference (WOCC’20), IEEE. pp. 1–6.
Dong, H., Li, T., Ding, R., Sun, J., 2018. A novel hybrid genetic algo-
rithm with granular information for feature selection and optimization.
Applied Soft Computing 65, 33–46.
Du, M., Li, F., Zheng, G., Srikumar, V., 2017. DeepLog: Anomaly detec-
tion and diagnosis from system logs through deep learning, in: Proceed-
ings of the 2017 ACM SIGSAC Conference on Computer and Commu-
nications Security (CCS’17), pp. 1285–1298.
Dutta, V., Choraś, M., Pawlicki, M., Kozik, R., 2020. A deep learning
ensemble for network anomaly and cyber-attack detection. Sensors 20,
4583.
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera,
F., 2018. Learning from imbalanced data sets. volume 10. Springer.
Ghamisi, P., Benediktsson, J.A., 2014. Feature selection based on hy-
bridization of genetic algorithm and particle swarm optimization. IEEE
Geoscience and Remote Sensing Letters (GRSL) 12, 309–313.
Gomes, H.M., Barddal, J.P., Enembreck, F., Bifet, A., 2017. A survey on
ensemble learning for data stream classification. ACM Computing Sur-
veys (CSUR) 50, 1–36.
Hamamoto, A.H., Carvalho, L.F., Sampaio, L.D.H., Abrão, T., Proença Jr,
M.L., 2018. Network anomaly detection system using genetic algorithm
and fuzzy logic. Expert Systems with Applications 92, 390–402.
Hartmann, W.M., 2004. Dimension reduction vs. variable selection, in:
International Workshop on Applied Parallel Computing (PARA’04),
Springer. pp. 931–938.
Hwang, R.H., Peng, M.C., Huang, C.W., Lin, P.C., Nguyen, V.L., 2020.
An unsupervised deep learning model for early network traffic anomaly
detection. IEEE Access 8, 30387–30399.
Imamverdiyev, Y., Abdullayeva, F., 2018. Deep learning method for denial
of service attack detection based on restricted boltzmann machine. Big
data 6, 159–169.
Javaid, A., Niyaz, Q., Sun, W., Alam, M., 2016. A deep learning approach
for network intrusion detection system, in: Proceedings of the 9th EAI
International Conference on Bio-inspired Information and Communica-
tions Technologies (formerly BIONETICS), pp. 21–26.
Jia, W.J., Zhang, Y.D., 2018. Survey on theories and methods of autoen-
coder. Computer Systems & Applications 5, 1.
Kennedy, J., Eberhart, R., 1995. Particle swarm optimization, in: Proceed-
ings of 95-International Conference on Neural Networks (ICNN), IEEE.
pp. 1942–1948.
Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J., 2019. A survey
of deep learning-based network anomaly detection. Cluster Computing
22, 949–961.
Laboratory, S.R., 2018. Malware public datasets. URL: https://mcfp.felk.
cvut.cz/publicDatasets/.
Lauzon, F.Q., 2012. An introduction to deep learning, in: 2012 11th In-
ternational Conference on Information Science, Signal Processing and
their Applications (ISSPA), IEEE. pp. 1438–1439.
Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter,
C., de Schaetzen, V., Duque, R., Bersini, H., Nowe, A., 2012. A sur-
vey on filter techniques for feature selection in gene expression microar-
ray analysis. IEEE/ACM Transactions on Computational Biology and
Bioinformatics 9, 1106–1119.
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E., 2017. A survey
of deep neural network architectures and their applications. Neurocom-
puting 234, 11–26.
Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., Wang, S., 2011. An
improved particle swarm optimization for feature selection. Journal of
Bionic Engineering 8, 191–200.
Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J., 2017.
Conditional variational autoencoder for prediction and feature recovery
applied to intrusion detection in iot. Sensors 17, 1967.
Ma, Q., Sun, C., Cui, B., Jin, X., 2021. A novel model for anomaly detection
in network traffic based on kernel support vector machine. Computers
& Security 104, 102215.
Ma, T., Wang, F., Cheng, J., Yu, Y., Chen, X., 2016. A hybrid spectral
clustering and deep neural network ensemble algorithm for intrusion de-
tection in sensor networks. Sensors 16, 1701.
Mahalakshmi, G., Uma, E., Aroosiya, M., Vinitha, M., 2021. Intrusion de-
tection system using convolutional neuralnetwork on unsw nb15 dataset,
in: Advances in Parallel Computing Technologies and Applications. IOS
Press, pp. 1–8.
Marini, F., Walczak, B., 2015. Particle swarm optimization (PSO). A tuto-
rial. Chemometrics and Intelligent Laboratory Systems 149, 153–165.
Merrill, N., Eskandarian, A., 2020. Modified autoencoder training and scor-
ing for robust unsupervised anomaly detection in deep learning. IEEE
Access 8, 101824–101833.
Min, B., Yoo, J., Kim, S., Shin, D., Shin, D., 2021. Network anomaly
detection using memory-augmented deep autoencoder. IEEE Access 9,
104695–104706.
Moustafa, N., Creech, G., Slay, J., 2017. Big data analytics for intrusion de-
tection system: Statistical decision-making using Finite Dirichlet mix-
ture models, in: Data analytics and decision support for cybersecurity:
Trends, Methodologies and Applications. Springer, pp. 127–156.
Moustafa, N., Slay, J., 2015. UNSW-NB15: a comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set),
in: 2015 Military Communications and Information Systems Confer-
ence (MilCIS), IEEE. pp. 1–6.
Moustafa, N., Slay, J., 2016a. The evaluation of network anomaly detec-
tion systems: statistical analysis of the UNSW-NB15 data set and the
comparison with the KDD99 data set. Information Security Journal: A
Aniss Chohra et al.: Preprint submitted to Elsevier Page 18 of 19
Optimized Feature Selection for Network Anomaly Detection
Global Perspective 25, 18–31.
Moustafa, N., Slay, J., 2016b. The evaluation of network anomaly detec-
tion systems: Statistical analysis of the UNSW-NB15 data set and the
comparison with the KDD99 data set. Information Security Journal: A
Global Perspective 25, 18–31.
Moustafa, N., Slay, J., Creech, G., 2019. Novel geometric area analysis
technique for anomaly detection using trapezoidal area estimation on
large-scale networks. IEEE Transactions on Big Data 5, 481–494.
NASA AVIRIS Sensor, 2021. Indian Pines dataset. URL:
http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_
Scenes#Indian_Pines.
Nkenyereye, L., Tama, B.A., Lim, S., 2021. A stacking-based deep neu-
ral network approach for effective network anomaly detection. CMC-
COMPUTERS MATERIALS & CONTINUA 66, 2217–2227.
Oreski, S., Oreski, G., 2014. Genetic algorithm-based heuristic for feature
selection in credit risk assessment. Expert Systems with Applications
41, 2052–2064.
Paxson, V., 1999. Bro: a system for detecting network intruders in real-
time. Computer networks 31, 2435–2463.
Rashid, M., Kamruzzaman, J., Imam, T., Wibowo, S., Gordon, S., 2022.
A tree-based stacking ensemble technique with feature selection for net-
work intrusion detection. Applied Intelligence , 1–14.
Roy, A., Singh, K.J., 2021. Multi-classification of UNSW-NB15 dataset
for network anomaly detection system, in: Proceedings of Interna-
tional Conference on Communication and Computational Technologies,
Springer. pp. 429–451.
Roy, S., Li, J., Choi, B.J., Bai, Y., 2022. A lightweight supervised intru-
sion detection mechanism for iot networks. Future Generation Computer
Systems 127, 276–285.
Sagi, O., Rokach, L., 2018. Ensemble learning: A survey. Wiley Interdis-
ciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249.
Sheikhpour, R., Sarram, M.A., Gharaghani, S., Chahooki, M.A.Z., 2017.
A survey on semi-supervised feature selection methods. Pattern Recog-
nition 64, 141–158.
Shen, Y., Stringhini, G., 2019. Attack2vec: Leveraging temporal word em-
beddings to understand the evolution of cyberattacks, in: 28th USENIX
Security Symposium (USENIX Security 19), pp. 905–921.
Tama, B.A., Nkenyereye, L., Islam, S.R., Kwak, K.S., 2020. An enhanced
anomaly detection in web traffic using a stack of classifier ensemble.
IEEE Access 8, 24120–24134.
Tang, T.A., Mhamdi, L., McLernon, D., Zaidi, S.A.R., Ghogho, M., 2016.
Deep learning approach for network intrusion detection in software de-
fined networking, in: 2016 international conference on wireless net-
works and mobile communications (WINCOM), IEEE. pp. 258–263.
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analy-
sis of the KDD CUP 99 data set, in: IEEE symposium on Computational
Intelligence for Security and Defense Applications (CISDA’09), IEEE.
pp. 1–6.
Team, Z., 2018. Zeek an open source network security monitoring tool.
URL: https://zeek.org/.
Xie, M., Han, S., Tian, B., Parvin, S., 2011. Anomaly detection in wireless
sensor networks: A survey. Journal of Network and computer Applica-
tions 34, 1302–1325.
Xiong, P., Cui, B., Cheng, Z., 2020. Anomaly network traffic detection
based on deep transfer learning, in: International Conference on Innova-
tive Mobile and Internet Services in Ubiquitous Computing (IMIS’20),
Springer. pp. 384–393.
Xue, B., Zhang, M., Browne, W.N., 2012. Particle swarm optimization for
feature selection in classification: A multi-objective approach. IEEE
Transactions on Cybernetics 43, 1656–1671.
Yang, Y., Zheng, K., Wu, C., Yang, Y., 2019. Improving the classifica-
tion effectiveness of intrusion detection by using improved conditional
variational autoencoder and deep neural network. Sensors 19, 2528.
Yin, C., Zhu, Y., Fei, J., He, X., 2017. A deep learning approach for intru-
sion detection using recurrent neural networks. IEEE Access 5, 21954–
21961.
Zhan, Z.H., Zhang, J., Li, Y., Chung, H.S.H., 2009. Adaptive particle
swarm optimization. IEEE Transactions on Systems, Man, and Cyber-
netics, Part B (Cybernetics) 39, 1362–1381.
Aniss Chohra et al.: Preprint submitted to Elsevier Page 19 of 19
... Halim et al. (2021) came up with an advanced genetic algorithm (GA) which proved to be more efficient than RFE, sequential feature selection (SFS) and CFS. Lastly, Chohra et al. (2022) proposed the particle swarm optimisation (PSO) algorithm, which combined swarm intelligence and ensemble learning techniques to establish the best settings for feature subsets and model hyperparameters. In order to address the limitations like inefficient FS, binary classification, and slow prediction, Masoudi-Sobhanzadeh and Emami-Moghaddam (2022) proposed a two-step ML approach using world competitive contests (WCC) optimisation algorithm and SVM to address these issues. ...
... The filter method is best suited for small to medium-sized datasets, and is useful for identifying highly relevant features based on statistical tests (Soe et al., 2020b). The wrapper method, which involves using a ML algorithm to select features based on a performance metric, is ideal for medium to large datasets (Chohra et al., 2022). The embedded method is appropriate for medium to large datasets, and involves incorporating FS into the model training process (Disha and Waheed, 2022). ...
... • Multi-objective optimisation: Multi-objective optimisation is a growing area of research in FS, as it provides a way of balancing different objectives, such as accuracy, interpretability, and computational efficiency (Kareem et al., 2022). Optimised meta-heuristic algorithms such as adaptive PSO can be employed (Chohra et al., 2022). Kumar et al. (2022b) suggested highly efficient meta-heuristic methods considering limited energy resource. ...
... Lightweight models tailored for IoT networks have also been designed. One study (Liu et al., 2021) proposed Particle Swarm Optimization (PSO) with one-class Support Vector Machine (SVM) (Chohra et al., 2022), and another (Naseri & Gharehchopogh, 2022) introduced a Binary Farmland Fertility Algorithm (BFFA) for feature selection, enhancing detection accuracy and computational efficiency. However, these techniques often incur high computational costs, which can negatively impact resource-constrained IoT systems and networks. ...
... Consequently, alternative energy sources like ultrasound and low-level microwaves have become more popular. However, tomographic reconstructions created with electromagnetic or sound waves are not as good as those made with X-rays [27]. This is because X-rays move in straight lines and do not diffract, allowing the transmission data to measure a specific object parameter along these lines. ...
Article
This research presents an innovative method for real-time detection of dielectric anomalies, with a primary focus on evaluating apple quality and ripeness using dielectric tomography. The study involves the development of an advanced tomography system within an anechoic chamber, harnessing electromagnetic wave technology and sophisticated antenna systems for data acquisition. The proposed framework encompasses critical stages, including data collection, range bounds computation, threshold determination, class membership assignment, and ensemble classification. By seamlessly integrating statistical methods, density-based clustering, and ensemble learning, this approach significantly enhances precision and reliability in anomaly detection. The integration of available statistical methods, density-based clustering, and ensemble learning may demand substantial computational resources, limiting the scalability and real-time applicability of the proposed framework. Empirical results demonstrate the superior performance of the method, with an accuracy rate of 98.9%, precision of 0.989, F-measure of 0.989, dielectric anomaly recall rate of 0.99, and a low error rate of 0.18. Overall, this research introduces an advanced approach with the potential to revolutionize apple quality assessment and industrial processes across various sectors.
... For this reason, heuristic algorithms have been developed to reduce the computational load of the search process. Evolutionary computation strategies including particle swarm optimization (PSO) [24][25][26][27], genetic algorithm (GA) [28], and differential evolution (DE) [29][30][31] have proven to be promising approaches in overcoming the shortcomings of the classic methods and achieve good solutions without the need to conduct an exhaustive search [33,34]. During the PSO, a population of particles with cooperative behavior moves around the search space and seeks for promising solutions [35]. ...
Article
Full-text available
Due to the non-selective behavior of gas sensors in electronic nose (e-nose) systems, the provided signals in exposure to target analytes contain un-needed information. These are considered as noise reducing the detection accuracy. Feature selection, as a pre-processing step in data analysis, removes extra information from the sensors’ signals and provides a more relevant data matrix with lower dimensionality to enhance the system selectivity. In the high-dimensional sensor array response spaces, it is however essential to improve the conventional algorithms to be able to cope with complicated feature selection problem. In this study, in order to acquire optimal responses from the gas sensor array of an e-nose system and increase its selectivity for detection of rice type and its storage duration (freshness), the feature selection problem was formulated in an optimization framework. For this reason, a particle swarm optimization (PSO) with an automatic stagnation detecting system was enhanced by genetic operators of differential evolution. This helped acquiring more exploration ability through providing oriented jumps. It was revealed that the system’s detection accuracy was improved when smaller subset of features was utilized instead of the whole response, indicating that the sensor array signals included large amount of irrelevant information. The improved PSO could significantly present lower error values than the standard PSO and other examined conventional algorithms. It was concluded the developed algorithm has the potential to be applied as a promising feature selection algorithm in high-dimensional signals of the e-nose systems.
... The evaluation was conducted on datasets including KDDCUP99, NLS-KDD, and UNSW-NB15, showcasing outcomes that demonstrated a high detection rate and accuracy while minimizing false alarms. In addition, some studies designed lightweight models to meet the characteristic of IoT network, Liu et al. [23] proposed Particle Swam Optimization (PSO) with one-class Support Vector Machine (SVM) [24] optimized PSO for feature selection with light GBM to build lightweight models for detecting attack. However, it is worth noting that these feature selection strategies often come at a high computational cost, especially when relying on GA, PSO, or machine learning-based classifiers, as a result, which have negative impact on resource-constraint IoT system and networks. ...
Article
Full-text available
Internet of Things (IoT) devices are widely used but also vulnerable to cyberattacks that can cause security issues. To protect against this, machine learning approaches have been developed for network intrusion detection in IoT. These often use feature reduction techniques like feature selection or extraction before feeding data to models. This helps make detection efficient for real-time needs. This paper thoroughly compares feature extraction and selection for IoT network intrusion detection in machine learning-based attack classification framework. It looks at performance metrics like accuracy, f1-score, and runtime, etc. on the heterogenous IoT dataset named Network TON-IoT using binary and multiclass classification. Overall, feature extraction gives better detection performance than feature selection as the number of features is small. Moreover, extraction shows less feature reduction compared with that of selection, and is less sensitive to changes in the number of features. However, feature selection achieves less model training and inference time compared with its counterpart. Also, more space to improve the accuracy for selection than extraction when the number of features changes. This holds for both binary and multiclass classification. The study provides guidelines for selecting appropriate intrusion detection methods for particular scenarios. Before, the TON-IoT heterogeneous IoT dataset comparison and recommendations were overlooked. Overall, the research presents a thorough comparison of feature reduction techniques for machine learning-driven intrusion detection in IoT networks.
... Predictive models generate regression models based on recent trends, considering a data point anomalous if it significantly differs from predicted values [62]. Ensemble methods use multiple algorithms and a voting mechanism for anomaly detection, enhancing accuracy but increasing complexity and computational time [63]. ...
Article
Full-text available
Advanced Machine Learning (ML) algorithms can be applied using Edge Computing (EC) to detect anomalies, which is the basis of Artificial Intelligence of Things (AIoT). EC has emerged as a solution for processing and analysing information on IoT devices. This field aims to allow the implementation of Machine/Deep Learning (DL) models on MicroController Units (MCUs). Integrating anomaly detection analysis on Internet of Things (IoT) devices produces clear benefits as it ensures the use of accurate data from the initial stage. However, this process poses a challenge due to the unique characteristics of IoT. This article presents a Systematic Literature Mapping of scientific research on the application of anomaly detection techniques in EC using MCUs. A total of 18 papers published over the period 2021–2023 were selected from a total of 162 in four databases of scientific papers. The results of this paper provide a comprehensive overview of anomaly detection using TinyML and MCUs. The main contributions of this survey are the fact that it aims to: (a) study techniques for anomaly detection in ML/DL and validation metrics used in the AIoT; (b) analyse data used in the estimation of models; (c) show how ML is applied in EC using hardware or software; (d) investigate the main microcontrollers, types of power supply, and communication technology; and (e) develop a taxonomy of ML/DL algorithms used to detect anomalies in TinyML. Finally, the benefits and challenges of this kind of TinyML analysis are described.
Article
Full-text available
Due to the recent advances in the Internet and communication technologies, network systems and data have evolved rapidly. The emergence of new attacks jeopardizes network security and make it really challenging to detect intrusions. Multiple network attacks by an intruder are unavoidable. Our research targets the critical issue of class imbalance in intrusion detection, a reflection of the real-world scenario where legitimate network activities significantly out number malicious ones. This imbalance can adversely affect the learning process of predictive models, often resulting in high false-negative rates, a major concern in Intrusion Detection Systems (IDS). By focusing on datasets with this imbalance, we aim to develop and refine advanced algorithms and techniques, such as anomaly detection, cost-sensitive learning, and oversampling methods, to effectively handle such disparities. The primary goal is to create models that are highly sensitive to intrusions while minimizing false alarms, an essential aspect of effective IDS. This approach is not only practical for real-world applications but also enhances the theoretical understanding of managing class imbalance in machine learning. Our research, by addressing these significant challenges, is positioned to make substantial contributions to cybersecurity, providing valuable insights and applicable solutions in the fight against digital threats and ensuring robustness and relevance in IDS development. An intrusion detection system (IDS) checks network traffic for security, availability, and being non-shared. Despite the efforts of many researchers, contemporary IDSs still need to further improve detection accuracy, reduce false alarms, and detect new intrusions. The mean convolutional layer (MCL), feature-weighted attention (FWA) learning, a bidirectional long short-term memory (BILSTM) network, and the random forest algorithm are all parts of our unique hybrid model called MCL-FWA-BILSTM. The CNN-MCL layer for feature extraction receives data after preprocessing. After convolution, pooling, and flattening phases, feature vectors are obtained. The BI-LSTM and self-attention feature weights are used in the suggested method to mitigate the effects of class imbalance. The attention layer and the BI-LSTM features are concatenated to create mapped features before feeding them to the random forest algorithm for classification. Our methodology and model performance were validated using NSL-KDD and UNSW-NB-15, two widely available IDS datasets. The suggested model’s accuracies on binary and multi-class classification tasks using the NSL-KDD dataset are 99.67% and 99.88%, respectively. The model’s binary and multi-class classification accuracies on the UNSW-NB15 dataset are 99.56% and 99.45%, respectively. Further, we compared the suggested approach with other previous machine learning and deep learning models and found it to outperform them in detection rate, FPR, and F-score. For both binary and multiclass classifications, the proposed method reduces false positives while increasing the number of true positives. The model proficiently identifies diverse network intrusions on computer networks and accomplishes its intended purpose. The suggested model will be helpful in a variety of network security research fields and applications.
Article
Full-text available
Feature selection is becoming a relevant problem within the field of machine learning. The feature selection problem focuses on the selection of the small, necessary, and sufficient subset of features that represent the general set of features, eliminating redundant and irrelevant information. Given the importance of the topic, in recent years there has been a boom in the study of the problem, generating a large number of related investigations. Given this, this work analyzes 161 articles published between 2019 and 2023 (20 April 2023), emphasizing the formulation of the problem and performance measures, and proposing classifications for the objective functions and evaluation metrics. Furthermore, an in-depth description and analysis of metaheuristics, benchmark datasets, and practical real-world applications are presented. Finally, in light of recent advances, this review paper provides future research opportunities.
Chapter
Full-text available
Networks have an important role in our modern life. In the network, Cyber security plays a crucial role in Internet security. An Intrusion Detection System (IDS) acts as a cyber security system which monitors and detects any security threats for software and hardware running on the network. There we have many existing IDS but still we face challenges in improving accuracy in detecting security vulnerabilities, not enough methods to reduce the level of alertness and detecting intrusion attacks. Many researchers have tried to solve the above problems by focusing on developing IDSs by machine learning methods. Machine learning methods can detect datas from past experience and differentiate normal and abnormal data. In our work, the Convolutional Neural Network(CNN) deep learning method was developed in solving the problem of identifying intrusion in a network. Using the UNSW NB15 public dataset we trained the CNN algorithm. The Dataset contains binary types of ‘0’ and ‘1’ in general for normal and attack datas. The experimental results showed that the proposed model achieves maximum accuracy in detection and we also performed evaluation metrics to analyze the performance of the CNN algorithm.
Article
Full-text available
Several studies have used machine learning algorithms to develop intrusion systems (IDS), which differentiate anomalous behaviours from the normal activities of network systems. Due to the ease of automated data collection and subsequently an increased size of collected data on network traffic and activities, the complexity of intrusion analysis is increasing exponentially. A particular issue, due to statistical and computation limitations, a single classifier may not perform well for large scale data as existent in modern IDS contexts. Ensemble methods have been explored in literature in such big data contexts. Although more complicated and requiring additional computation, literature has a note that ensemble methods can result in better accuracy than single classifiers in different large scale data classification contexts, and it is interesting to explore how ensemble approaches can perform in IDS. In this research, we introduce a tree-based stacking ensemble technique (SET) and test the effectiveness of the proposed model on two intrusion datasets (NSL-KDD and UNSW-NB15). We further enhance incorporate feature selection techniques to select the best relevant features with the proposed SET. A comprehensive performance analysis shows that our proposed model can better identify the normal and anomaly traffic in network than other existing IDS models. This implies the potentials of our proposed system for cybersecurity in Internet of Things (IoT) and large scale networks.
Article
Full-text available
In recent years, attacks on network environments continue to rapidly advance and are increasingly intelligent. Accordingly, it is evident that there are limitations in existing signature-based intrusion detection systems. In particular, for novel attacks such as Advanced Persistent Threat (APT), signature patterns have problems with poor generalization performance. Furthermore, in a network environment, attack samples are rarely collected compared to normal samples, creating the problem of imbalanced data. Anomaly detection using an autoencoder has been widely studied in this environment, and learning is through semi-supervised learning methods to overcome these problems. This approach is based on the assumption that reconstruction errors for samples that are not used for training will be large, but an autoencoder is often over-generalized and this assumption is often broken. In this paper, we propose a network intrusion detection method using a memory-augmented deep auto-encoder (MemAE) that can solve the over-generalization problem of autoencoders. The MemAE model is trained to reconstruct the input of an abnormal sample that is close to a normal sample, which solves the generalization problem for such abnormal samples. Experiments were conducted on the NSL-KDD, UNSW-NB15, and CICIDS 2017 datasets, and it was confirmed that the proposed method is better than other one-class models.
Article
Full-text available
An anomaly-based intrusion detection system (A-IDS) provides a critical aspect in a modern computing infrastructure since new types of attacks can be discovered. It prevalently utilizes several machine learning algorithms (ML) for detecting and classifying network traffic. To date, lots of algorithms have been proposed to improve the detection performance of AIDS , either using individual or ensemble learners. In particular, ensemble learners have shown remarkable performance over individual learners in many applications, including in cybersecurity domain. However, most existing works still suffer from unsatisfactory results due to improper ensemble design. The aim of this study is to emphasize the effectiveness of stacking ensemble-based model for AIDS , where deep learning (e.g., deep neural network [DNN]) is used as base learner model. The effectiveness of the proposed model and base DNN model are benchmarked empirically in terms of several performance metrics, i.e., Matthew's correlation coefficient, accuracy, and false alarm rate. The results indicate that the proposed model is superior to the base DNN model as well as other existing ML algorithms found in the literature.
Article
Full-text available
Currently, expert systems and applied machine learning algorithms are widely used to automate network intrusion detection. In critical infrastructure applications of communication technologies, the interaction among various industrial control systems and the Internet environment intrinsic to the IoT technology makes them susceptible to cyber-attacks. Given the existence of the enormous network traffic in critical Cyber-Physical Systems (CPSs), traditional methods of machine learning implemented in network anomaly detection are inefficient. Therefore, recently developed machine learning techniques, with the emphasis on deep learning, are finding their successful implementations in the detection and classification of anomalies at both the network and host levels. This paper presents an ensemble method that leverages deep models such as the Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) and a meta-classifier (i.e., logistic regression) following the principle of stacked generalization. To enhance the capabilities of the proposed approach, the method utilizes a two-step process for the apprehension of network anomalies. In the first stage, data pre-processing, a Deep Sparse AutoEncoder (DSAE) is employed for the feature engineering problem. In the second phase, a stacking ensemble learning approach is utilized for classification. The efficiency of the method disclosed in this work is tested on heterogeneous datasets, including data gathered in the IoT environment, namely IoT-23, LITNET-2020, and NetML-2020. The results of the evaluation of the proposed approach are discussed. Statistical significance is tested and compared to the state-of-the-art approaches in network anomaly detection.
Article
Full-text available
Over the last few years, web phishing attacks have been constantly evolving causing customers to lose trust in e-commerce and online services. Various tools and systems based on a blacklist of phishing websites are applied to detect the phishing websites. Unfortunately, the fast evolution of technology has led to the born of more sophisticated methods when building websites to attract users. Thus, the latest and newly deployed phishing websites; for example, zero-day phishing websites, cannot be detected by using these blacklist-based approaches. Several recent research studies have been adopting machine learning techniques to identify phishing websites and utilizing them as an early alarm method to identify such threats. However, the important website features have been selected based on human experience or frequency analysis of website features in most of these approaches. In this paper, intelligent phishing website detection using particle swarm optimization-based feature weighting is proposed to enhance the detection of phishing websites. The proposed approach suggests utilizing particle swarm optimization (PSO) to weight various website features effectively to achieve higher accuracy when detecting phishing websites. In particular, the proposed PSO-based website feature weighting is used to differentiate between the various features in websites, based on how important they contribute towards recognizing the phishing from legitimate websites. The experimental results indicated that the proposed PSO-based feature weighting achieved outstanding improvements in terms of classification accuracy, true positive and negative rates, and false positive and negative rates of the machine learning models using only fewer websites features utilized in the detection of phishing websites.
Article
As the Internet of Things (IoT) is becoming increasingly popular, we have experienced more security breaches that are associated with the connection of vulnerable IoT devices. Therefore, it is crucial to employ intrusion detection techniques to mitigate attacks that exploit IoT security vulnerabilities. However, due to the limited capabilities of IoT devices and the specific protocols used, conventional intrusion detection mechanisms may not work well for IoT environments. In this paper, we propose a novel intrusion detection model that uses machine learning to effectively detect cyber-attacks and anomalies in resource-constraint IoT networks. Through a set of optimizations including removal of multicollinearity, sampling, and dimensionality reduction, our model can identify the most important features to detect intrusions using much fewer training data and less training time. Extensive experiments were performed on the CICIDS2017 and NSL-KDD datasets respectively to evaluate the proposed approach. The experimental results on two popular datasets show that our model has a high detection rate and a low false alarm rate. It outperforms existing models in multiple performance metrics and is consistent in classifying major cyber-attacks, respectively. Most importantly, unlike traditional resource-intensive intrusion detection systems, the proposed model is lightweight and can be deployed on IoT nodes with limited power and storage capabilities.
Article
Machine learning models are widely used for anomaly detection in network traffic. Effective transformation of the raw traffic data into mathematical expressions and hyper-parameter adjustment are two important steps before training the machine learning classifier, which is used to predict whether the unknown traffic is normal or abnormal. In this paper, a novel model SVM-L is proposed for anomaly detection in network traffic. In particular, raw URLs are treated as natural language, and then transformed into mathematical vectors via statistical laws and natural language processing technique. They are used as the training data for the traffic classifier, the kernel Support Vector Machine (SVM). Based on the idea of the dual formulation of kernel SVM and Linear Discriminant Analysis (LDA), we propose an optimization model to adjust the hyper-parameter of the classifier. The corresponding problem is simply one-dimensional, and is easily solved by the golden section method. Numerical tests indicate that the proposed model achieves more than 99% accuracy on all tested datasets, and outperforms the state of the arts in terms of standard evaluation measurements.