Content uploaded by Chansu Han
Author content
All content in this area was uploaded by Chansu Han on Jun 20, 2022
Content may be subject to copyright.
Malicious Packet Classification Based on Neural
Network Using Kitsune Features
Kohei Miyamoto1, Hiroki Goto1, Ryosuke Ishibashi1, Han Chansu2, Tao Ban2,
Takeshi Takahashi2, and Jun’ichi Takeuchi1
1Kyushu University, Fukuoka, Japan
2National Institute of Information and Communications Technology, Tokyo, Japan
Abstract. Network Intrusion Detection Systems (NIDSes) play an im-
portant role in security operations to detect and defend against cyberat-
tacks. As artificial intelligence (AI)-powered NIDSes are adaptive to var-
ious kinds of attacks by exploring the knowledge presented in the data,
they are in high demand to treat the cyberattacks nowadays with increas-
ing diversity and intensity. In this paper, we present a feasibility study on
neural networks (NNs) -based NIDSes aiming to solve the packet classi-
fication problem – distinguishing malicious packets from benign packets
while specifying a class of anomaly to which a malicious packet belongs.
We employ the features defined by Kitsune – a lightweight NN-based
packet anomaly detector – as inputs to our classifier. A Kitsune fea-
ture vector is composed of statistics calculated from a single packet and
its predecessors using a successive algorithm. We evaluate the proposed
packet classification scheme using the CSE-CIC-IDS2018 open dataset.
The experimental results show that our method can achieve good per-
formance for particular attack types so that it can meet the requirement
of a practical NIDSes.
Keywords: Network intrusion detection system ·packet classification ·
neural networks
1 Introduction
The number and variety of devices connected to the Internet are growing expo-
nentially in recent years, and so do the cyberattacks targeting these devices. In
defending against these cyberattacks, network intrusion detection systems (NID-
Ses) help security operators by monitoring network traffic, detecting suspicious
behaviors therein, and issuing alerts based on the detection results. So far, there
have been many kinds of NIDSes proposed. Depending on the different detection
mechanisms and implementations, these NIDSes have their own pros and cons.
Generally, a proper aggregation of the outputs from multiple NIDSes is expected
to realize better security protection than using a single appliance.
In this paper, we discuss effective ways to develop an AI-powered packet
classifier that can predict a class name of cyberattacks for each attack packet. AI-
powered NIDSes can be roughly divided into two categories: anomaly detectors
2 Authors Suppressed Due to Excessive Length
and multi-class classifiers. An anomaly detector outputs values called anomaly
scores to measure whether captured packets are benign or not. In contrast, a
classifier outputs class labels of anomalies that the packets belong to. Both types
use feature vectors extracted from monitored traffics as their input. Nevertheless,
anomaly detectors can be trained in an unsupervised way: training data need
not to be labeled; while classifiers have to be trained using labeled data.
We use Kitsune [5], a well-known AI-based packet anomaly detector, as a
base of our development of AI-powered NIDS. The first step of our AI-powered
NIDS is to extract the input features from monitored traffics. For that process,
we utilize the feature extractor of Kitsune, which employs a successive algorithm
to extract the statistical features that characterize the packet and the commu-
nication sessions the packet lies in. Exemplary features includes the length and
protocol of the packet and frequencies of packet communication between two
hosts, etc. Based on these features, Kitsune performs packet level anomaly de-
tection based on the reconstruction error of auto-encoders.
The system framework proposed in Kitsune has proved to be effective and
efficient as a packet anomaly detector. In this paper, we seek to further extend
its application to solve the multi-class packet classification problem. To do so,
we design a new packet classifier based on NNs that can explore the knowledge
in Kitsune features to predict the attack types associated with the packet. We
evaluate the proposed scheme using the CSE-CIC-IDS2018 open dataset [2, 7].
The results of our experiment show that our classifier has good performance for
many classes in the dataset.
In the rest of this paper, we explain the feature extraction in Kitsune and our
scheme to classify packets using the Kitsune features. Finally, we explain the our
experiments for evaluation of our scheme, show the results of the experiments
and discuss them.
2 Related Work
Ishibashi et al.[3] proposed a method to generate labeled datasets using alerts
from existing NIDSes, which we can use. Hwang et al.[1] proposed another packet
classification method using features based on word embedding techniques. Taka-
hashi et al.[8] proposed the integration of various methods for analysing cyber-
attacks. Trainable NIDSes like our work will be useful as a component of such
products.
3 Preliminaries
In this section, we first introduce the problem setting of packet classification.
Then, we provide a brief introduction of Kitsune and its feature extraction.
3.1 Problem Setting
When an NIDS is working, it monitors traffic in a specified network. The traffic
can be represented as a sequence of packets. Let Pbe the set of all possible
Title Suppressed Due to Excessive Length 3
packets and p1, p2, . . . be a sequence of packets captured by the NIDS. We assume
each packet to be timestamped when it is captured. When we have a finite set
of classes C, packet classifiers can be regarded as a mapping from Pto C. A NN
for packet classification is also regarded as a mapping from some feature space
to the set of probability vectors over C. In this paper, we consider only simple
feed-forward NNs. Therefore, the feature space is the real vector space RDof
fixed dimension D. Hence, a feature extraction method can be regarded as a
mapping from Pto RD. When a packet is captured, an NIDS extracts a feature
vector from it. Then, the NIDS input the feature vector to the NN and obtain a
probability vector over Cas its output. As a classifier, the NIDS outputs a class
which has the maximum probability in the distribution.
Labeled data for training of such classifiers are pairs of a packet and a class
i.e. members of P × C. Each class in Cis one-hot encoded before we input it
for a NN. A NN for classification is usually trained by minimizing categorical
cross-entropy between output vectors and true labels. The categorical cross-
entropy is a loss function commonly used in multi-class classification. When a
given label is i-th class in the set of classes C, this label is encoded into a one-
hot vector y= (y1,· · · , y|C|) where yi= 1 and other elements are 0. For an
output probability vector ˆyof the NN and an encoded label y, the categorical
cross-entropy loss function for them is defined as L(y, ˆy) = −P|C|
j=1 yjlog ˆyi.
The optimization using the categorical cross-entropy loss leads to approximate
the conditional probability of classes given input features. The data is usually
split into 2 subsets, a train set and a test set. During the training phase, we
optimize the weights of the NN by using the train set. During the testing phase,
we evaluate the performance of the trained NN by using the test set. There are
some kinds of measures of the performance of classifiers, e.g. precision, recall
and F-measure of the prediction are commonly used.
3.2 Feature Extraction of Kitsune
Kitsune [5] is an NIDS based on a NN-based anomaly detector. A reference
implementation of Kitsune is provided at [9]. The anomaly detector of Kitsune
has a unique structure which consists of an ensemble of auto-encoders and a
unique preprocessing method called feature mapper. However, in this paper, we
use only the feature extractor from the structure of Kitsune. Therefore, we do
not explain the detail of the anomaly detector of Kitsune.
The feature extractor of Kitsune is intended to be capable to process arriving
packets successively without large memory consumption. Captured packets pro-
vide us a timestamp, a packet size, MAC addresses, IP addresses and TCP/UDP
ports related to them. The feature extractor uses these information of a given
packet to calculate a feature vector and update states for the calculation.
The feature extractor manages statistics called damped incremental statis-
tics. For a parameter λ > 0, an incremental statistic is a 3-tuple of real values
denoted as ISλ= (w, LS, SS). Each incremental statistic ISλis related to a
data stream determined by MAC addresses, IP addresses and TCP/UDP ports.
4 Authors Suppressed Due to Excessive Length
each packet is also related to some data streams. Data streams are divided into
the following 4 types.
–srcIP : an IP address of source of packet
–srcMAC-IP : (srcMAC, srcIP), a pair of MAC address and IP address of
source of a packet
–Channel : (srcIP, dstIP), a pair of srcIP and an IP address of destination of
a packet
–Socket : (srcIP, srcPort, dstIP, dstPort), a 4-tuple of IP addresses and TCP/UDP
ports used by a packet
Therefore, for each packet, the Feature extractor updates 4 incremental statistics.
Each incremental statistics is initialized by zero values. Let xbe a packet size of a
given packet for 3 types of incremental statistics except Channel-type and let xbe
a jitter value for Channel-type, where the jitter value is defined as the difference
of the timestamp from the timestamp of the last packet observed between the
same IP addresses. For a packet with a timestamp t, each incremental statistics
are updated by the followings.
γ= 2−λ(t−tlast),(1)
(w, LS, SS)←(γw + 1, γ LS +x, γ SS +x2),(2)
where tlast means the timestamp of the last packet related to the same stream
and ←mean updates of variables in the left side. These updates can be done
successively without keeping information of packets processed in the past except
the timestamp tlast. The parameter λdetermines the intensity of time decay done
by multiplying γ. The feature extractor uses multiple values of λ. It extracts
features based on each of them and concatenates these features. Then, it output
the concatenated feature vector.
From an incremental statistic, we obtain statistics, µ=LS/w and σ=
p|SS/w −(LS/w)2|. They reflect approximations of a mean value and a stan-
dard deviation of xobserved in some period respectively. Since each of them
depends on single data stream, these features are called as 1D statistics in [5].
For Channel and Socket type streams, other 4 kinds of statistics called as 2D
statistics are defined. They depend on 2 data streams, for example 2 streams re-
lated to different source IP addresses. They reflect characteristics like covariance
and correlation between 2 streams.
The feature extractor extracts 20 statistics from a packet for each λ. The
extracted statistics consist of 4×3 = 12 1D statistics and 2 ×4 = 8 2D statistics.
In [5], λ= 5,3,1,0.1,0.01 are employed. The extracted feature vectors used for
anomaly detection are 5 ×20 = 100 dimensional vectors consisting of 60 1D
statistics and 40 2D statistics.
4 Methodology
In this section, we propose a new packet classification method based on NN and
Kitsune features.
Title Suppressed Due to Excessive Length 5
Suppose we have a dataset consisting of packets and labels that indicate
which class each packet belongs to. Using the relationship between packets and
labels in the dataset, we can construct a packet classifier in a supervised learning
way. NN is a powerful model capable of learning such relationship between inputs
and outputs. Packet classifiers using NN can adapt to various characteristics of
attacks and traffics by learning appropriate data.
We propose using Kitsune features as inputs for a NN-based packet classifier.
Kitsune uses features extracted from its feature extractor for anomaly detection.
However, we can use them as input also for packet classification based on a NN.
Although Kitsune features are originally 100-dimensional vectors, we use only
the 60-dimensional subset consisting of 1D statistics. The reason for this is that
the extraction of 2D statistics has difficulty on computational time for large
datasets.
In this paper, we use a simple feed-forward NN as classifiers. Our classifier
has a 60 dimensional input layer and a softmax layer as an output layer. An
output for each input is regarded as a probability vector over a set of classes
to predict. Our classifier takes an argmax of the output probability vector and
output the class corresponding to it as the prediction.
Using Kitsune features as inputs of a NN has the following benefit. After the
training, we can use the NN for online processing, because the feature extraction
is done in an online manner.
5 Experiment
In this section, we show the experimental results using an open dataset CSE-
CIC-IDS2018 [2]. First, we provide an introduction to the dataset. Then, we
introduce the setting of the experiments and show its results.
5.1 CSE-CIC-IDS2018 Dataset
CSE-CIC-IDS2018 dataset[2, 6] is an open dataset of traffics of cyberattacks.
This dataset is provided by the Communications Security Establishment (CSE)
and the Canadian Institute for Cybersecurity (CIC) and distributed at [7].
CSE-CIC-IDS2018 dataset was generated by simulating a network with be-
nign traffic and running some tools to attack the simulated network. This dataset
contains raw data of captured traffics per day and information of the attacks.
The attacks’ information includes the kinds of each attack, periods of each at-
tack, and IP addresses of attackers/targets of each attack. Therefore, we can
label packets in the captured traffics by using the information of the attacks.
This dataset consists of data of 10 days. The traffics in the data were captured
per day and per machine in the network. Each day includes traffics of 1 to 3
kinds of attacks and all days include benign traffics. All kinds of attacks have no
overlapping of their periods. This dataset contains 14 kinds of attacks in total.
Therefore, including benign class, we use 15 classes for our experiments.
6 Authors Suppressed Due to Excessive Length
5.2 Labeling and Feature Extraction
We label the data by the following procedure. For each packet, we see the times-
tamp of capture and IP addresses of it. If the timestamp is in a period of a kind
of attack and the packet was transmitted from the attackers to the targets, we
label the packet the attack’s name. If no kinds of attacks contain the timestamp
of the packet in their periods or the packet is not from attackers to targets, we
label the packet ”BENIGN”.
Since the number of benign packets is usually much larger than the number
of anomaly packets, for each day, we use only packets captured at target IP
addresses of attacks to ease the imbalance of labels.
We use the reference implementation of the extractor in [9]. The extraction
was done in the temporal order of captures.
5.3 Experiments and Results
We did two types of experiments. The first type is experiments using data per
day. Since each day’s data contain labels at most four-classes, we performed
at most four class classification in this type of experiment. We separated the
last 20% of all the packets in each attack duration, which were used as the
test data. We used the rest of the packets in attack duration and the packets
captured from 30 minutes before the attack duration as the training data. As
an exception, for data from 2018/02/21, we used only sub-sampled 25% of such
training data, because the total number of packets of this date is too large to
use in our experiments. The sub-sampling was done with stratification.
The second type is an experiment using the data from all the days. We did
stratified sampling of 20000 packets from each day’s training data used in the
previous experiments and merged them into a sub dataset. We call this sub
dataset mixture data. We performed a 15 classes classification with this mixture
data. We used the same test data as the first type of experiments in the test
phase.
In all experiments, we used NNs consisting of an input layer, 3 hidden layers
and a softmax layer. All of hidden layers have 16 units. We used hyperbolic
tangent activation functions in the hidden layers. We implemented NNs using
Tensorflow and used Adam [4] as the optimizer. The initial learning rate was
0.001. The training batch size was 1024. We enabled the training to early stop
when the validation loss does not update its minimum for 5 epochs.
In each experiment, we did the following procedure 5 times and calculated
mean values of evaluation metrics we obtained. First, we initialize a model. Then,
we randomly split the training data into train/validation sets with a ratio of 3:1.
Finally, we train the model and evaluate it with the test data. All features are
standardized based on the train set in the training and evaluation.
The metrics we employed to evaluate a classifier were precision, recall and
F-measure. Let Tbe a test set of pairs of a feature vector and a label. Let p(x)
be a class predicted by the classifier for the input feature vector x. The metrics
are defined for each class c, as precision(c) = TP(c)/(TP(c)+FP(c)), recall(c) =
Title Suppressed Due to Excessive Length 7
Fig. 1: Experimental results per day
Fig. 2: Experimental results for the mixture data
TP(c)/(TP(c)+FN(c)) and F(c) = 2precision(c)·recall(c)/(precision(c)+recall(c)),
where TP(c) = P(x,y)∈T I(p(x) = c, y =c), TN(c) = P(x,y)∈T I(p(x)=c, y =
c), FP(c) = P(x,y)∈T I(p(x) = c, y =c), FN(c) = P(x,y )∈T I(p(x)=c, y =c)
and I(·) is the indicator function. All of these metrics take values in [0,1] and
larger values mean better performance.
We show the results for experiments using data per day in Fig.1. The hori-
zontal axis shows the names of classes and the dates. The vertical axis shows the
mean values of each metric. We also show the results for the experiment using
mixture data in Fig.2.
6 Discussion
Fig.1 shows that our classifiers for each day have good performance except for
classes ”Infiltration” and ”SQL-Injection”. However, Fig.2 shows that our classi-
fier for mixture data has lower performance for some classes than the classifiers
8 Authors Suppressed Due to Excessive Length
for each day. In particular, the performance for the DDoS-LOIC-HTTP class and
the DoS-SlowHTTPTest show significant decreases. This implies that our classi-
fier may not classify these classes in practical situations. Since Kitsune features
originally are designed to be used in anomaly detection, they may not contain
sufficient information to discriminate some attacks.
7 Conclusion
We propose a new packet classifier based on NN using Kitsune features as inputs.
We evaluate the proposed classifier by experiments using CSE-CIC-IDS2018
open dataset. Our experiments show that Kitsune 1D features can be used for
packet classification with some performance for many kinds of attacks. However,
it also shows that the performance is not good when we should discriminate a
large number of classes of attacks.
Acknowledgments
This research was conducted under a contract of “MITIGATE” among “Research
and Development for Expansion of Radio Wave Resources (JPJ000254),”which
was supported by the Ministry of Internal Affairs and Communications, Japan.
References
1. Hwang, R.H., Peng, M.C., Nguyen, V.L., Chang, Y.L.: An lstm-based deep learning
approach for classifying malicious traffic at the packet level. Applied Sciences 9(16)
(2019)
2. Iman, S., Arash, H.L., Ali, A.G.: “toward generating a new intrusion detection
dataset and intrusion traffic characterization”. In: 4th International Conference on
Information Systems Security and Privacy (ICISSP) (Jan 2018)
3. Ishibashi, R., Goto, H., Han, C., Ban, T., Takahashi, T., Takeuchi, J.: ”which packet
did they catch? associating nids alerts with their communication sessions”. In: The
16th Asia Joint Conference on Information Security (Aug 2021)
4. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (Dec 2014)
5. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.: ”kitsune: an ensemble of au-
toencoders for online network intrusion detection”. In: Network and Distributed
System Security Symposium 2018 (Feb 2018)
6. Online: Cse-cic-ids2018 on aws. https://www.unb.ca/cic/datasets/ids-2018.html
(last visited on 2021-12-31)
7. Online: A realistic cyber defense dataset (cse-cic-ids2018).
https://registry.opendata.aws/cse-cic-ids2018/ (last visited on 2021-12-31)
8. Takahashi, T., Umemura, Y., Han, C., Ban, T., Furumoto, K., Nakamura, O., Yosh-
ioka, K., Takeuchi, J., Murata, N., Shiraishi, Y.: Designing comprehensive cyber
threat analysis platform: Can we orchestrate analysis engines? In: 2021 IEEE Inter-
national Conference on Pervasive Computing and Communications Workshops and
other Affiliated Events (PerCom Workshops) (2021)
9. ymirsky: Kitsune-py. https://github.com/ymirsky/Kitsune-py (last visited on 2021-
12-31)