Conference PaperPDF Available

Malicious Packet Classification Based on Neural Network Using Kitsune Features

Authors:

Abstract

Network Intrusion Detection Systems (NIDSes) play an important role in security operations to detect and defend against cyberat-tacks. As artificial intelligence (AI)-powered NIDSes are adaptive to various kinds of attacks by exploring the knowledge presented in the data, they are in high demand to treat the cyberattacks nowadays with increasing diversity and intensity. In this paper, we present a feasibility study on neural networks (NNs)-based NIDSes aiming to solve the packet classification problem-distinguishing malicious packets from benign packets while specifying a class of anomaly to which a malicious packet belongs. We employ the features defined by Kitsune-a lightweight NN-based packet anomaly detector-as inputs to our classifier. A Kitsune feature vector is composed of statistics calculated from a single packet and its predecessors using a successive algorithm. We evaluate the proposed packet classification scheme using the CSE-CIC-IDS2018 open dataset. The experimental results show that our method can achieve good performance for particular attack types so that it can meet the requirement of a practical NIDSes.
Malicious Packet Classification Based on Neural
Network Using Kitsune Features
Kohei Miyamoto1, Hiroki Goto1, Ryosuke Ishibashi1, Han Chansu2, Tao Ban2,
Takeshi Takahashi2, and Jun’ichi Takeuchi1
1Kyushu University, Fukuoka, Japan
2National Institute of Information and Communications Technology, Tokyo, Japan
Abstract. Network Intrusion Detection Systems (NIDSes) play an im-
portant role in security operations to detect and defend against cyberat-
tacks. As artificial intelligence (AI)-powered NIDSes are adaptive to var-
ious kinds of attacks by exploring the knowledge presented in the data,
they are in high demand to treat the cyberattacks nowadays with increas-
ing diversity and intensity. In this paper, we present a feasibility study on
neural networks (NNs) -based NIDSes aiming to solve the packet classi-
fication problem distinguishing malicious packets from benign packets
while specifying a class of anomaly to which a malicious packet belongs.
We employ the features defined by Kitsune a lightweight NN-based
packet anomaly detector as inputs to our classifier. A Kitsune fea-
ture vector is composed of statistics calculated from a single packet and
its predecessors using a successive algorithm. We evaluate the proposed
packet classification scheme using the CSE-CIC-IDS2018 open dataset.
The experimental results show that our method can achieve good per-
formance for particular attack types so that it can meet the requirement
of a practical NIDSes.
Keywords: Network intrusion detection system ·packet classification ·
neural networks
1 Introduction
The number and variety of devices connected to the Internet are growing expo-
nentially in recent years, and so do the cyberattacks targeting these devices. In
defending against these cyberattacks, network intrusion detection systems (NID-
Ses) help security operators by monitoring network traffic, detecting suspicious
behaviors therein, and issuing alerts based on the detection results. So far, there
have been many kinds of NIDSes proposed. Depending on the different detection
mechanisms and implementations, these NIDSes have their own pros and cons.
Generally, a proper aggregation of the outputs from multiple NIDSes is expected
to realize better security protection than using a single appliance.
In this paper, we discuss effective ways to develop an AI-powered packet
classifier that can predict a class name of cyberattacks for each attack packet. AI-
powered NIDSes can be roughly divided into two categories: anomaly detectors
2 Authors Suppressed Due to Excessive Length
and multi-class classifiers. An anomaly detector outputs values called anomaly
scores to measure whether captured packets are benign or not. In contrast, a
classifier outputs class labels of anomalies that the packets belong to. Both types
use feature vectors extracted from monitored traffics as their input. Nevertheless,
anomaly detectors can be trained in an unsupervised way: training data need
not to be labeled; while classifiers have to be trained using labeled data.
We use Kitsune [5], a well-known AI-based packet anomaly detector, as a
base of our development of AI-powered NIDS. The first step of our AI-powered
NIDS is to extract the input features from monitored traffics. For that process,
we utilize the feature extractor of Kitsune, which employs a successive algorithm
to extract the statistical features that characterize the packet and the commu-
nication sessions the packet lies in. Exemplary features includes the length and
protocol of the packet and frequencies of packet communication between two
hosts, etc. Based on these features, Kitsune performs packet level anomaly de-
tection based on the reconstruction error of auto-encoders.
The system framework proposed in Kitsune has proved to be effective and
efficient as a packet anomaly detector. In this paper, we seek to further extend
its application to solve the multi-class packet classification problem. To do so,
we design a new packet classifier based on NNs that can explore the knowledge
in Kitsune features to predict the attack types associated with the packet. We
evaluate the proposed scheme using the CSE-CIC-IDS2018 open dataset [2, 7].
The results of our experiment show that our classifier has good performance for
many classes in the dataset.
In the rest of this paper, we explain the feature extraction in Kitsune and our
scheme to classify packets using the Kitsune features. Finally, we explain the our
experiments for evaluation of our scheme, show the results of the experiments
and discuss them.
2 Related Work
Ishibashi et al.[3] proposed a method to generate labeled datasets using alerts
from existing NIDSes, which we can use. Hwang et al.[1] proposed another packet
classification method using features based on word embedding techniques. Taka-
hashi et al.[8] proposed the integration of various methods for analysing cyber-
attacks. Trainable NIDSes like our work will be useful as a component of such
products.
3 Preliminaries
In this section, we first introduce the problem setting of packet classification.
Then, we provide a brief introduction of Kitsune and its feature extraction.
3.1 Problem Setting
When an NIDS is working, it monitors traffic in a specified network. The traffic
can be represented as a sequence of packets. Let Pbe the set of all possible
Title Suppressed Due to Excessive Length 3
packets and p1, p2, . . . be a sequence of packets captured by the NIDS. We assume
each packet to be timestamped when it is captured. When we have a finite set
of classes C, packet classifiers can be regarded as a mapping from Pto C. A NN
for packet classification is also regarded as a mapping from some feature space
to the set of probability vectors over C. In this paper, we consider only simple
feed-forward NNs. Therefore, the feature space is the real vector space RDof
fixed dimension D. Hence, a feature extraction method can be regarded as a
mapping from Pto RD. When a packet is captured, an NIDS extracts a feature
vector from it. Then, the NIDS input the feature vector to the NN and obtain a
probability vector over Cas its output. As a classifier, the NIDS outputs a class
which has the maximum probability in the distribution.
Labeled data for training of such classifiers are pairs of a packet and a class
i.e. members of P × C. Each class in Cis one-hot encoded before we input it
for a NN. A NN for classification is usually trained by minimizing categorical
cross-entropy between output vectors and true labels. The categorical cross-
entropy is a loss function commonly used in multi-class classification. When a
given label is i-th class in the set of classes C, this label is encoded into a one-
hot vector y= (y1,· · · , y|C|) where yi= 1 and other elements are 0. For an
output probability vector ˆyof the NN and an encoded label y, the categorical
cross-entropy loss function for them is defined as L(y, ˆy) = P|C|
j=1 yjlog ˆyi.
The optimization using the categorical cross-entropy loss leads to approximate
the conditional probability of classes given input features. The data is usually
split into 2 subsets, a train set and a test set. During the training phase, we
optimize the weights of the NN by using the train set. During the testing phase,
we evaluate the performance of the trained NN by using the test set. There are
some kinds of measures of the performance of classifiers, e.g. precision, recall
and F-measure of the prediction are commonly used.
3.2 Feature Extraction of Kitsune
Kitsune [5] is an NIDS based on a NN-based anomaly detector. A reference
implementation of Kitsune is provided at [9]. The anomaly detector of Kitsune
has a unique structure which consists of an ensemble of auto-encoders and a
unique preprocessing method called feature mapper. However, in this paper, we
use only the feature extractor from the structure of Kitsune. Therefore, we do
not explain the detail of the anomaly detector of Kitsune.
The feature extractor of Kitsune is intended to be capable to process arriving
packets successively without large memory consumption. Captured packets pro-
vide us a timestamp, a packet size, MAC addresses, IP addresses and TCP/UDP
ports related to them. The feature extractor uses these information of a given
packet to calculate a feature vector and update states for the calculation.
The feature extractor manages statistics called damped incremental statis-
tics. For a parameter λ > 0, an incremental statistic is a 3-tuple of real values
denoted as ISλ= (w, LS, SS). Each incremental statistic ISλis related to a
data stream determined by MAC addresses, IP addresses and TCP/UDP ports.
4 Authors Suppressed Due to Excessive Length
each packet is also related to some data streams. Data streams are divided into
the following 4 types.
srcIP : an IP address of source of packet
srcMAC-IP : (srcMAC, srcIP), a pair of MAC address and IP address of
source of a packet
Channel : (srcIP, dstIP), a pair of srcIP and an IP address of destination of
a packet
Socket : (srcIP, srcPort, dstIP, dstPort), a 4-tuple of IP addresses and TCP/UDP
ports used by a packet
Therefore, for each packet, the Feature extractor updates 4 incremental statistics.
Each incremental statistics is initialized by zero values. Let xbe a packet size of a
given packet for 3 types of incremental statistics except Channel-type and let xbe
a jitter value for Channel-type, where the jitter value is defined as the difference
of the timestamp from the timestamp of the last packet observed between the
same IP addresses. For a packet with a timestamp t, each incremental statistics
are updated by the followings.
γ= 2λ(ttlast),(1)
(w, LS, SS)(γw + 1, γ LS +x, γ SS +x2),(2)
where tlast means the timestamp of the last packet related to the same stream
and mean updates of variables in the left side. These updates can be done
successively without keeping information of packets processed in the past except
the timestamp tlast. The parameter λdetermines the intensity of time decay done
by multiplying γ. The feature extractor uses multiple values of λ. It extracts
features based on each of them and concatenates these features. Then, it output
the concatenated feature vector.
From an incremental statistic, we obtain statistics, µ=LS/w and σ=
p|SS/w (LS/w)2|. They reflect approximations of a mean value and a stan-
dard deviation of xobserved in some period respectively. Since each of them
depends on single data stream, these features are called as 1D statistics in [5].
For Channel and Socket type streams, other 4 kinds of statistics called as 2D
statistics are defined. They depend on 2 data streams, for example 2 streams re-
lated to different source IP addresses. They reflect characteristics like covariance
and correlation between 2 streams.
The feature extractor extracts 20 statistics from a packet for each λ. The
extracted statistics consist of 4×3 = 12 1D statistics and 2 ×4 = 8 2D statistics.
In [5], λ= 5,3,1,0.1,0.01 are employed. The extracted feature vectors used for
anomaly detection are 5 ×20 = 100 dimensional vectors consisting of 60 1D
statistics and 40 2D statistics.
4 Methodology
In this section, we propose a new packet classification method based on NN and
Kitsune features.
Title Suppressed Due to Excessive Length 5
Suppose we have a dataset consisting of packets and labels that indicate
which class each packet belongs to. Using the relationship between packets and
labels in the dataset, we can construct a packet classifier in a supervised learning
way. NN is a powerful model capable of learning such relationship between inputs
and outputs. Packet classifiers using NN can adapt to various characteristics of
attacks and traffics by learning appropriate data.
We propose using Kitsune features as inputs for a NN-based packet classifier.
Kitsune uses features extracted from its feature extractor for anomaly detection.
However, we can use them as input also for packet classification based on a NN.
Although Kitsune features are originally 100-dimensional vectors, we use only
the 60-dimensional subset consisting of 1D statistics. The reason for this is that
the extraction of 2D statistics has difficulty on computational time for large
datasets.
In this paper, we use a simple feed-forward NN as classifiers. Our classifier
has a 60 dimensional input layer and a softmax layer as an output layer. An
output for each input is regarded as a probability vector over a set of classes
to predict. Our classifier takes an argmax of the output probability vector and
output the class corresponding to it as the prediction.
Using Kitsune features as inputs of a NN has the following benefit. After the
training, we can use the NN for online processing, because the feature extraction
is done in an online manner.
5 Experiment
In this section, we show the experimental results using an open dataset CSE-
CIC-IDS2018 [2]. First, we provide an introduction to the dataset. Then, we
introduce the setting of the experiments and show its results.
5.1 CSE-CIC-IDS2018 Dataset
CSE-CIC-IDS2018 dataset[2, 6] is an open dataset of traffics of cyberattacks.
This dataset is provided by the Communications Security Establishment (CSE)
and the Canadian Institute for Cybersecurity (CIC) and distributed at [7].
CSE-CIC-IDS2018 dataset was generated by simulating a network with be-
nign traffic and running some tools to attack the simulated network. This dataset
contains raw data of captured traffics per day and information of the attacks.
The attacks’ information includes the kinds of each attack, periods of each at-
tack, and IP addresses of attackers/targets of each attack. Therefore, we can
label packets in the captured traffics by using the information of the attacks.
This dataset consists of data of 10 days. The traffics in the data were captured
per day and per machine in the network. Each day includes traffics of 1 to 3
kinds of attacks and all days include benign traffics. All kinds of attacks have no
overlapping of their periods. This dataset contains 14 kinds of attacks in total.
Therefore, including benign class, we use 15 classes for our experiments.
6 Authors Suppressed Due to Excessive Length
5.2 Labeling and Feature Extraction
We label the data by the following procedure. For each packet, we see the times-
tamp of capture and IP addresses of it. If the timestamp is in a period of a kind
of attack and the packet was transmitted from the attackers to the targets, we
label the packet the attack’s name. If no kinds of attacks contain the timestamp
of the packet in their periods or the packet is not from attackers to targets, we
label the packet ”BENIGN”.
Since the number of benign packets is usually much larger than the number
of anomaly packets, for each day, we use only packets captured at target IP
addresses of attacks to ease the imbalance of labels.
We use the reference implementation of the extractor in [9]. The extraction
was done in the temporal order of captures.
5.3 Experiments and Results
We did two types of experiments. The first type is experiments using data per
day. Since each day’s data contain labels at most four-classes, we performed
at most four class classification in this type of experiment. We separated the
last 20% of all the packets in each attack duration, which were used as the
test data. We used the rest of the packets in attack duration and the packets
captured from 30 minutes before the attack duration as the training data. As
an exception, for data from 2018/02/21, we used only sub-sampled 25% of such
training data, because the total number of packets of this date is too large to
use in our experiments. The sub-sampling was done with stratification.
The second type is an experiment using the data from all the days. We did
stratified sampling of 20000 packets from each day’s training data used in the
previous experiments and merged them into a sub dataset. We call this sub
dataset mixture data. We performed a 15 classes classification with this mixture
data. We used the same test data as the first type of experiments in the test
phase.
In all experiments, we used NNs consisting of an input layer, 3 hidden layers
and a softmax layer. All of hidden layers have 16 units. We used hyperbolic
tangent activation functions in the hidden layers. We implemented NNs using
Tensorflow and used Adam [4] as the optimizer. The initial learning rate was
0.001. The training batch size was 1024. We enabled the training to early stop
when the validation loss does not update its minimum for 5 epochs.
In each experiment, we did the following procedure 5 times and calculated
mean values of evaluation metrics we obtained. First, we initialize a model. Then,
we randomly split the training data into train/validation sets with a ratio of 3:1.
Finally, we train the model and evaluate it with the test data. All features are
standardized based on the train set in the training and evaluation.
The metrics we employed to evaluate a classifier were precision, recall and
F-measure. Let Tbe a test set of pairs of a feature vector and a label. Let p(x)
be a class predicted by the classifier for the input feature vector x. The metrics
are defined for each class c, as precision(c) = TP(c)/(TP(c)+FP(c)), recall(c) =
Title Suppressed Due to Excessive Length 7
Fig. 1: Experimental results per day
Fig. 2: Experimental results for the mixture data
TP(c)/(TP(c)+FN(c)) and F(c) = 2precision(c)·recall(c)/(precision(c)+recall(c)),
where TP(c) = P(x,y)∈T I(p(x) = c, y =c), TN(c) = P(x,y)∈T I(p(x)=c, y =
c), FP(c) = P(x,y)∈T I(p(x) = c, y =c), FN(c) = P(x,y )∈T I(p(x)=c, y =c)
and I(·) is the indicator function. All of these metrics take values in [0,1] and
larger values mean better performance.
We show the results for experiments using data per day in Fig.1. The hori-
zontal axis shows the names of classes and the dates. The vertical axis shows the
mean values of each metric. We also show the results for the experiment using
mixture data in Fig.2.
6 Discussion
Fig.1 shows that our classifiers for each day have good performance except for
classes ”Infiltration” and ”SQL-Injection”. However, Fig.2 shows that our classi-
fier for mixture data has lower performance for some classes than the classifiers
8 Authors Suppressed Due to Excessive Length
for each day. In particular, the performance for the DDoS-LOIC-HTTP class and
the DoS-SlowHTTPTest show significant decreases. This implies that our classi-
fier may not classify these classes in practical situations. Since Kitsune features
originally are designed to be used in anomaly detection, they may not contain
sufficient information to discriminate some attacks.
7 Conclusion
We propose a new packet classifier based on NN using Kitsune features as inputs.
We evaluate the proposed classifier by experiments using CSE-CIC-IDS2018
open dataset. Our experiments show that Kitsune 1D features can be used for
packet classification with some performance for many kinds of attacks. However,
it also shows that the performance is not good when we should discriminate a
large number of classes of attacks.
Acknowledgments
This research was conducted under a contract of “MITIGATE” among “Research
and Development for Expansion of Radio Wave Resources (JPJ000254),”which
was supported by the Ministry of Internal Affairs and Communications, Japan.
References
1. Hwang, R.H., Peng, M.C., Nguyen, V.L., Chang, Y.L.: An lstm-based deep learning
approach for classifying malicious traffic at the packet level. Applied Sciences 9(16)
(2019)
2. Iman, S., Arash, H.L., Ali, A.G.: “toward generating a new intrusion detection
dataset and intrusion traffic characterization”. In: 4th International Conference on
Information Systems Security and Privacy (ICISSP) (Jan 2018)
3. Ishibashi, R., Goto, H., Han, C., Ban, T., Takahashi, T., Takeuchi, J.: ”which packet
did they catch? associating nids alerts with their communication sessions”. In: The
16th Asia Joint Conference on Information Security (Aug 2021)
4. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (Dec 2014)
5. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.: ”kitsune: an ensemble of au-
toencoders for online network intrusion detection”. In: Network and Distributed
System Security Symposium 2018 (Feb 2018)
6. Online: Cse-cic-ids2018 on aws. https://www.unb.ca/cic/datasets/ids-2018.html
(last visited on 2021-12-31)
7. Online: A realistic cyber defense dataset (cse-cic-ids2018).
https://registry.opendata.aws/cse-cic-ids2018/ (last visited on 2021-12-31)
8. Takahashi, T., Umemura, Y., Han, C., Ban, T., Furumoto, K., Nakamura, O., Yosh-
ioka, K., Takeuchi, J., Murata, N., Shiraishi, Y.: Designing comprehensive cyber
threat analysis platform: Can we orchestrate analysis engines? In: 2021 IEEE Inter-
national Conference on Pervasive Computing and Communications Workshops and
other Affiliated Events (PerCom Workshops) (2021)
9. ymirsky: Kitsune-py. https://github.com/ymirsky/Kitsune-py (last visited on 2021-
12-31)
... However, we can use the feature extractor independently. Though Kitsune features were designed for anomaly detection, they also can be used for packet classification and can achieve good performance [43]. ...
Article
Full-text available
Network Intrusion Detection Systems (NIDSs) are crucial tools for ensuring cyber security. Recently, machine learning-based NIDSs have gained popularity due to their ability to adapt to various anomalies. To enable machine learning techniques, packet-level features have been proposed for packet-level classification, but this approach may generate an excessive number of security alerts and reduce performance due to irrelevant packets. To address these limitations, this paper proposes a session-level classification approach that consolidates packet-level classification outputs to identify anomalous sessions. The effectiveness of the proposed approach is demonstrated by a prototype system. Experiments on a publicly available benchmark dataset demonstrate the high performance of proposed approach achieving F1-measure exceeding 98%. It also shows that even when we used only a few packets in head parts of each session to obtain session-level predictions, the high F1-measure still could be achieved. This result implies that the proposed approach is also efficient in terms of the number of packets to be processed. These results highlight the promising potential of the proposed approach for adaptive network intrusion detection.
... First, we analyzed the relationship between the alerts of three commercial NIDS and the network packets associated with those alerts, and then developed a method to create training data for supervised learning [19], [20]. Next, we developed a supervised learning system using a neural network with five hidden layers and the created training data [21]. In this learning system, packet sequences are treated as time series and feature vectors reflecting their characteristics are created. ...
Article
In this paper, we developed the latest IoT honeypots to capture IoT malware currently on the loose, analyzed IoT malware with new features such as persistent infection, developed malware removal methods to be provided to IoT device users. Furthermore, as attack behaviors using IoT devices become more diverse and sophisticated every year, we conducted research related to various factors involved in understanding the overall picture of attack behaviors from the perspective of incident responders. As the final stage of countermeasures, we also conducted research and development of IoT malware disabling technology to stop only IoT malware activities in IoT devices and IoT system disabling technology to remotely control (including stopping) IoT devices themselves.
... To do packet-level classification, we require packet-level features extracted from each packet. Miyamoto et al. [45] proposed packet-level classification method based on features extracted by Kitsune [21]. In this section, we always use 60-dimensional Kitsune features consisting of statistics called 1D statistics in [21]. ...
Article
Full-text available
It is crucial to implement innovative artificial intelligence (AI)-powered network intrusion detection systems (NIDSes) to protect enterprise networks from cyberattacks, which have recently become more diverse and sophisticated. High-quality labeled training datasets are required to train AI-powered NIDSes; such datasets are globally scarce, and generating new training datasets is considered cumbersome. In this study, we investigate the possibility of an approach that integrates the strengths of existing security appliances to generate labeled training datasets that can be leveraged to develop brand-new AI-powered cybersecurity solutions. We begin by locating communication flows that the deployed NIDSes detect as suspicious, investigating their causal factors, and assigning appropriate labels in a universal format. Then, we output the packet data in the identified communication flows and the corresponding alert-type labels as labeled data. We demonstrate the effectiveness of the labeling scheme by evaluating classification models trained with the labeled dataset we generated. Furthermore, we provide case studies to examine the performance of several commonly used NIDSes and on practical approaches to automating the security triage process. Labeled datasets in this study are generated using public datasets and open-source NIDSes to ensure the reproducibility of the results. The datasets and the software tools are made publicly accessible for research use.
Article
The paper considers a method for estimating the fractal properties of traffic, and also evaluates the statistical parameters of the fractal dimension of IoT traffic. An analysis of real traffic with attacks from the Kitsune dump and an analysis of the fractal properties of traffic in normal mode and under the influence of attacks such as SSDP Flood, Mirai, OS Scan showed that jumps in the fractal dimension of traffic when attacks occur can be used to create algorithms for detecting computer attacks in IoT networks. Studies have shown that in the case of online analysis of network traffic, when assessing the RF, preference should be given to the modified algorithm for estimating the Hurst exponent in a sliding analysis window.
Article
Full-text available
Modern computer networks (CN), having a complex and often heterogeneous structure, generate large volumes of multi-dimensional multi-label data. Accounting for information about multi-label experimental data (ED) can improve the efficiency of solving a number of information security problems: from CN profiling to detecting and preventing computer attacks on CN. The aim of the work is to develop a multi-label artificial neural network (ANN) architecture for detecting and classifying computer attacks in multi-label ED, and its comparative analysis with known analogues in terms of binary metrics for assessing the quality of classification. A formalization of ANN in terms of matrix algebra is proposed, which allows taking into account the case of multi-label classification and the new architecture of ANN with multiple output using the proposed formalization. The advantage of the proposed formalization is the conciseness of a number of entries associated with the ANN operating mode and learning mode. Proposed architecture allows solving the problems of detecting and classifying multi-label computer attacks, on average, 5% more efficiently than known analogues. The observed gain is due to taking into account multi-label patterns between class labels at the training stage through the use of a common first layer. The advantages of the proposed ANN architecture are scalability to any number of class labels and fast convergence.
Article
Full-text available
Recently, deep learning has been successfully applied to network security assessments and intrusion detection systems (IDSs) with various breakthroughs such as using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) to classify malicious traffic. However, these state-of-the-art systems also face tremendous challenges to satisfy real-time analysis requirements due to the major delay of the flow-based data preprocessing, i.e., requiring time for accumulating the packets into particular flows and then extracting features. If detecting malicious traffic can be done at the packet level, detecting time will be significantly reduced, which makes the online real-time malicious traffic detection based on deep learning technologies become very promising. With the goal of accelerating the whole detection process by considering a packet level classification, which has not been studied in the literature, in this research, we propose a novel approach in building the malicious classification system with the primary support of word embedding and the LSTM model. Specifically, we propose a novel word embedding mechanism to extract packet semantic meanings and adopt LSTM to learn the temporal relation among fields in the packet header and for further classifying whether an incoming packet is normal or a part of malicious traffic. The evaluation results on ISCX2012, USTC-TFC2016, IoT dataset from Robert Gordon University and IoT dataset collected on our Mirai Botnet show that our approach is competitive to the prior literature which detects malicious traffic at the flow level. While the network traffic is booming year by year, our first attempt can inspire the research community to exploit the advantages of deep learning to build effective IDSs without suffering significant detection delay.
Article
Full-text available
Neural networks have become an increasingly popular solution for network intrusion detection systems (NIDS). Their capability of learning complex patterns and behaviors make them a suitable solution for differentiating between normal traffic and network attacks. However, a drawback of neural networks is the amount of resources needed to train them. Many network gateways and routers devices, which could potentially host an NIDS, simply do not have the memory or processing power to train and sometimes even execute such models. More importantly, the existing neural network solutions are trained in a supervised manner. Meaning that an expert must label the network traffic and update the model manually from time to time. In this paper, we present Kitsune: a plug and play NIDS which can learn to detect attacks on the local network, without supervision, and in an efficient online manner. Kitsune's core algorithm (KitNET) uses an ensemble of neural networks called autoencoders to collectively differentiate between normal and abnormal traffic patterns. KitNET is supported by a feature extraction framework which efficiently tracks the patterns of every network channel. Our evaluations show that Kitsune can detect various attacks with a performance comparable to offline anomaly detectors, even on a Raspberry PI. This demonstrates that Kitsune can be a practical and economic NIDS.
Conference Paper
With exponential growth in the size of computer networks and developed applications, the significant increasing of the potential damage that can be caused by launching attacks is becoming obvious. Meanwhile, Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are one of the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of adequate dataset, anomaly-based approaches in intrusion detection systems are suffering from accurate deployment, analysis and evaluation. There exist a number of such datasets such as DARPA98, KDD99, ISC2012, and ADFA13 that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Based on our study over eleven available datasets since 1998, many such datasets are out of date and unreliable to use. Some of these datasets suffer from lack of traffic diversity and volumes, some of them do not cover the variety of attacks, while others anonymized packet information and payload which cannot reflect the current trends, or they lack feature set and metadata. This paper produces a reliable dataset that contains benign and seven common attack network flows, which meets real world criteria and is publicly available. Consequently, the paper evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.